Install

The order in which computers come up matters. The install boot order is: bastion, k3s control nodes, then k3s agents. I strongly recommend you let the control node come up before you start the agent work.

tl;dr

  1. Update your configuration (cluster_config) file to match your desired cluster state.
    1. Currently this only supports defining a single bastion node.
    2. k3s nodes require to be defined as <ip>|<hostname>|<agent|primary>, the third field should define if the node belongs in the control plane or is an agent (worker). HA controller isn’t currently supported.
  2. Run ./generate_k3_nodes.sh
    1. This will generate a series of files in generated_assets organized by type and hostname.
  3. Manually inspect the user-data to ensure no blank spots.
  4. Flash Ubuntu 20.10 onto your USB drive.
  5. Copy the data in the user-data file on the flash drive.

Verification & Trouble Shooting

Each node can take a while to come up depending on how big the initial updates are but once they start coming up, login to your control node and run kubectl get nodes -o wide. You should see an output like this :

NAME           STATUS   ROLES                  AGE     VERSION        INTERNAL-IP   EXTERNAL-IP
aegaeon-a006   Ready    <none>                 11h     v1.22.6+k3s1   10.0.0.35     <none>
aegaeon-a001   Ready    <none>                 21h     v1.22.6+k3s1   10.0.0.30     <none>
aegaeon-a002   Ready    <none>                 11h     v1.22.6+k3s1   10.0.0.31     <none>
aegaeon-c001   Ready    control-plane,master   21h     v1.22.6+k3s1   10.0.0.20     <none>
aegaeon-a005   Ready    <none>                 10h     v1.22.6+k3s1   10.0.0.34     <none>
aegaeon-a003   Ready    <none>                 10h     v1.22.6+k3s1   10.0.0.32     <none>
aegaeon-a004   Ready    <none>                 7h55m   v1.22.6+k3s1   10.0.0.33     <none>

Problems I Ran Into

If the status is NotReady for an extended period of time log into the node with the INTERNAL-IP, and run cat /var/log/syslog | grep k3s read the notes to get some ideas of whats going on. If you can’t find something quickly you’ve got two nuclear options : manually run the updates/installs with the steps in the user-data file or just reflash the drive/userdata and let it start again.

flannel exited : operation not supported

Due to a missed install of the vxlan module. (Verify by running modprobe vxlan)

Remedy with sudo apt install linux-modules-extra-raspi && reboot.

Node Password Rejected

This one was a little hard but if you see this in the log run the following on the failing node:

  • k3s-agent-uninstall.sh
  • sudo rm -f /etc/rancher/node/password

Log onto the primary node and run :

  • cat /var/lib/rancher/k3s/server/cred/node-passwd Check for any lines that match the name and delete them from the file.

Log back into the failing node and run :

  • sudo su
  • Re run the command from the user-data file that installs k3s (it will be in the runcmd section).

user-data explanation

The only major difference between the agent and control node is the arguments passed to the k3s command for initial setup.

The user-data file will perform the following actions :

  • update and upgrade ubuntu packages to current
  • set the computer host name
  • set up the user aegaeon
  • set up the public key for the computer
  • install and configure docker
  • install and configure k3s
  • copy private keys
  • copy global ssh_config
  • copy various shell scripts
#cloud-config
# See cloud-init documentation for available options:
# https://cloudinit.readthedocs.io/

package_update: true
package_upgrade: true
packages:
    - docker.io
    - net-tools
    - linux-modules-extra-raspi

hostname: ${NODE_HOSTNAME}

ssh_pwauth: false

groups:
  - ubuntu: [root, sys]

users:
  - default
  - name: aegaeon
    gecos: aegaeon
    sudo: ALL=(ALL) NOPASSWD:ALL
    groups: sudo
    ssh_import_id: None
    lock_passwd: true
    shell: /bin/bash
    ssh_authorized_keys:
      - ${K3_NODE_PUBLIC_KEY}

runcmd:
    - sed -i '$ s/$/ cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1 swapaccount=1/' /boot/firmware/cmdline.txt
    - curl -sfL https://get.k3s.io | K3S_TOKEN=${K3S_CLUSTER_TOKEN} K3S_KUBECONFIG_MODE="644" sh -s - server --disable servicelb --no-deploy traefik
  # The below line is for booting an agent node on the network
  # - curl -sfL https://get.k3s.io | K3S_TOKEN=${K3S_CLUSTER_TOKEN} K3S_URL=https://${PRIMARY_K3S_IP}:6443 sh -

power_state:
  mode: reboot
  timeout: 60
  condition: True