The order in which computers come up matters. The install boot order is: bastion, k3s control nodes, then k3s agents. I strongly recommend you let the control node come up before you start the agent work.
cluster_config
) file to match your desired cluster state.
<ip>|<hostname>|<agent|primary>
, the third field should define if the node belongs in the control plane or is an agent (worker). HA controller isn’t currently supported../generate_k3_nodes.sh
generated_assets
organized by type and hostname.Ubuntu 20.10
onto your USB drive.user-data
file on the flash drive.Each node can take a while to come up depending on how big the initial updates are but once they start coming up, login to your control node and run kubectl get nodes -o wide
.
You should see an output like this :
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP
aegaeon-a006 Ready <none> 11h v1.22.6+k3s1 10.0.0.35 <none>
aegaeon-a001 Ready <none> 21h v1.22.6+k3s1 10.0.0.30 <none>
aegaeon-a002 Ready <none> 11h v1.22.6+k3s1 10.0.0.31 <none>
aegaeon-c001 Ready control-plane,master 21h v1.22.6+k3s1 10.0.0.20 <none>
aegaeon-a005 Ready <none> 10h v1.22.6+k3s1 10.0.0.34 <none>
aegaeon-a003 Ready <none> 10h v1.22.6+k3s1 10.0.0.32 <none>
aegaeon-a004 Ready <none> 7h55m v1.22.6+k3s1 10.0.0.33 <none>
If the status is NotReady
for an extended period of time log into the node with the INTERNAL-IP, and run cat /var/log/syslog | grep k3s
read the notes
to get some ideas of whats going on. If you can’t find something quickly you’ve got two nuclear options : manually run the updates/installs with the steps in the user-data
file or just
reflash the drive/userdata and let it start again.
Due to a missed install of the vxlan module. (Verify by running modprobe vxlan
)
Remedy with sudo apt install linux-modules-extra-raspi && reboot
.
This one was a little hard but if you see this in the log run the following on the failing node:
k3s-agent-uninstall.sh
sudo rm -f /etc/rancher/node/password
Log onto the primary node and run :
cat /var/lib/rancher/k3s/server/cred/node-passwd
Check for any lines that match the name and delete them from the file.Log back into the failing node and run :
sudo su
user-data
file that installs k3s (it will be in the runcmd
section).The only major difference between the agent and control node is the arguments passed to the k3s command for initial setup.
The user-data file will perform the following actions :
aegaeon
ssh_config
#cloud-config
# See cloud-init documentation for available options:
# https://cloudinit.readthedocs.io/
package_update: true
package_upgrade: true
packages:
- docker.io
- net-tools
- linux-modules-extra-raspi
hostname: ${NODE_HOSTNAME}
ssh_pwauth: false
groups:
- ubuntu: [root, sys]
users:
- default
- name: aegaeon
gecos: aegaeon
sudo: ALL=(ALL) NOPASSWD:ALL
groups: sudo
ssh_import_id: None
lock_passwd: true
shell: /bin/bash
ssh_authorized_keys:
- ${K3_NODE_PUBLIC_KEY}
runcmd:
- sed -i '$ s/$/ cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1 swapaccount=1/' /boot/firmware/cmdline.txt
- curl -sfL https://get.k3s.io | K3S_TOKEN=${K3S_CLUSTER_TOKEN} K3S_KUBECONFIG_MODE="644" sh -s - server --disable servicelb --no-deploy traefik
# The below line is for booting an agent node on the network
# - curl -sfL https://get.k3s.io | K3S_TOKEN=${K3S_CLUSTER_TOKEN} K3S_URL=https://${PRIMARY_K3S_IP}:6443 sh -
power_state:
mode: reboot
timeout: 60
condition: True