Upgrading Talos and Kubernetes

Background

I have been operating a Kubernetes cluster for a few months now. This website runs on an 8-node Kubernetes cluster composed of:

  • 3 control plane nodes
  • 3 worker nodes
  • 2 Longhorn storage nodes

The cluster operates with a mix of arm64 and amd64 architectures, which means all container images must be multi-architecture compatible. It's deployed on Talos, a modern, API-driven Linux distribution designed specifically for Kubernetes, with no traditional SSH access. Talos approach ensures a secure, streamlined infrastructure, managed entirely via API. Inspired by the Datavirke blog, the setup follows a similar design but incorporates several unique elements.

As a strong proponent of GitOps, I store the entire configuration of the cluster in Git and deploy it from there. There's a lot more to this approach, which I plan to cover in future posts. For now, here’s an overview of the core technologies being used in this setup:

  • Gitea for hosting source code and package registries.
  • Talos as the operating system.
  • KubeSpan for wireguard mesh networking.
  • Talhelper for storing machine configurations in Git.
  • FluxCD for continuous integration and cluster synchronization.
  • SOPS to encrypt secrets in Git.
  • Cilium as the container network interface (CNI).
  • RenovateBot to automate dependency updates.
  • Cloudflare Zero Trust as an ingress controller, using ingress-nginx with cloudflared deployments. This allows external traffic without needing to open incoming firewall ports, helping reduce costs.

Upgrading Talos

I was running Talos v1.7.5 and Kubernetes 1.30.2 when I noticed a new release: Talos v1.8.0 and Kubernetes v1.31.1. Given the redundancy in my multi-node cluster, I was confident that even if any issues arose during the upgrade, my website and other services would remain operational. I followed a staged upgrade approach:

  • Upgrade 1 control node first.
  • Upgrade the remaining 2 control nodes.
  • Upgrade 1 worker node.
  • Upgrade the remaining worker nodes.

I used the following command to generate the upgrade command based on my cluster configuration:

talhelper gencommand upgrade

This command generated the necessary steps tailored to my setup, ensuring that I didn't overlook anything, particularly since I am using custom builds. I copy-pasted the command into a script and started upgrading the first node. The process was smooth, and within an hour, my entire cluster was running Talos v1.8.0. Once the cluster upgrade was completed, I also updated my local talosctl and talhelper binaries.

Upgrading Kubernetes

After upgrading Talos, the next step was upgrading Kubernetes.

talhelper gencommand upgrade

This command acts as a simple wrapper for talosctl upgrade-k8s. It only requires one node to initiate the upgrade process, after which it uses Talos discovery to locate all other nodes and upgrade them in sequence. The upgrade to Kubernetes v1.31.1 took approximately 20-30 minutes to complete for the entire cluster, and the process was extremely smooth and painless.

Post-upgrade experience

One of the most impressive aspects of the upgrade was that none of the applications running in the cluster experienced any downtime. However, after the upgrade, FluxCD stopped functioning. Upon investigation, I discovered that Flux v2.3.0 did not support Kubernetes v1.31.1, but fortunately, a v2.4.0 release was scheduled for the end of September. I waited a few days for the upgrade and then performed the following steps to upgrade Flux to v2.4.0:

flux install --export > ./clusters/my-cluster/flux-system/gotk-components.yaml
kubectl apply --server-side --force-conflicts -f ./clusters/my-cluster/flux-system/gotk-components.yaml
flux reconcile ks flux-system --with-source

After upgrading Flux, everything started operating smoothly again.

Conclusion

Upgrading both Talos and Kubernetes was a smooth experience overall. However, a key takeaway from this process is the importance of verifying the compatibility of all components, like FluxCD, before proceeding with a Kubernetes upgrade. In my case, it would have been better to wait for FluxCD v2.4.0 to avoid the temporary downtime, but given the acceptable risk, everything worked out fine in the end.