Kubernetes the not so hard way with Ansible - The worker - (K8s v1.28)

This post is based on Kelsey Hightower’s Kubernetes The Hard Way - Bootstrapping Kubernetes Workers.

It makes sense to use a recent Linux kernel in general. Container runtimes like containerd and also Cilium (which comes later) profit a lot if a recent kernel is available. I recommend to use a kernel >=5.4 if possible. Ubuntu 20.04 provides a linux-image-5.15.0-83-generic package with Kernel 5.15 e.g. or install the Hardware Enablement Stack (HWE) (linux-generic-hwe-20.04) which contains kernel 5.15 or even newer kernels. Ubuntu 20.04 already uses Kernel 5.4 by default (which contains wireguard module by default btw.). As of writing this blog post there is already Kernel 6.5 available for Ubuntu 22.04 e.g. (and that’s the one I use by installing linux-generic-hwe-22.04-edge package).

Before containerd a lot of Kubernetes installations most probably used Docker as container runtime. But Docker/dockershim was deprecated with Kubernetes v1.21 and was removed with Kubernetes v1.24. Behind the scene Docker already used containerd. So Docker at the end was just an additional “layer” that is no longer needed for Kubernetes. containerd together with runc is kind of a replacement for Docker so to say. I’ve written a blog post how to migrate from Docker/dockershim to containerd: Kubernetes: Replace dockershim with containerd and runc.

A container runtime is needed to execute workloads that you deploy to Kubernetes. A workload is normally a Docker container image (which you build locally, on a Jenkins server or whatever build pipeline you have in place) which runs a webserver or any other service that listens on a port.

To make containerd work, runc and CNI plugins are needed. runc is a CLI tool for spawning and running containers on Linux according to the OCI specification. CNI, a Cloud Native Computing Foundation project, consists of a specification and libraries for writing plugins to configure network interfaces in Linux containers, along with a number of supported plugins. CNI concerns itself only with network connectivity of containers and removing allocated resources when the container is deleted.

The defaults of these two roles should be reasonable. I’ll just override one default setting in group_vars/k8s_worker.yml:

cni_tmp_directory: "/opt/tmp/cni"

So lets install these two roles:

ansible-galaxy install githubixx.runc
ansible-galaxy install githubixx.cni

Next I gonna install containerd which is (kinda) modern replacement for Docker with the help of my Ansible role for containerd. containerd is a container runtime which will be installed on each Kubernetes worker node in the cluster so that Pods (basically the workload distributed as container images) can run there.

So first install the Ansible role for containerd:

ansible-galaxy install githubixx.containerd

In general the default variables of this role should be just fine. Just make sure if you changed runc_bin_directory that you also adjust BinaryName in containerd_config.

For all variables the containerd role offers please see default.yml.

As containerd is relevant for the K8s worker nodes I’ll override two default variables in group_vars/k8s_worker.yml. E.g.:

containerd_tmp_directory: "/opt/tmp/containerd"
containerd_binary_directory: "/usr/local/sbin"

Also add the roles (runc, cni and containerd) to our playbook file k8s.yml e.g.:

-
  hosts: k8s_worker
  roles:
    -
      role: githubixx.cni
      tags: role-cni
    -
      role: githubixx.runc
      tags: role-runc
    -
      role: githubixx.containerd
      tags: role-containerd

If everything is in place the roles can be deployed on all worker nodes (which also includes the controller nodes as I already mentioned previously as they need Cilium running which is deployed as Pods on every node - so worker and controller hosts):

ansible-playbook --tags=role-runc k8s.yml
ansible-playbook --tags=role-cni k8s.yml
ansible-playbook --tags=role-containerd k8s.yml

In Kubernetes control plane I installed Kubernetes kube-apiserver, kube-scheduler and kube-controller-manager on the controller nodes. For the worker I’ve also prepared an Ansible role which installs Kubernetes worker components. The Kubernetes part of a worker node needs a kubelet and a kube-proxy daemon. The worker do the “real” work. They run the Pods (which are containers deployed via container images). So in production and if you do real work it won’t hurt if you choose bigger iron for the worker hosts 😉

kubelet is responsible to create a pod/container on a worker node if the scheduler has chosen that node to run a pod on. The kube-proxy cares about routes. E.g. if a Pod or a Service was added kube-proxy takes care to update routing rules with iptables (by default) or IPVS on newer Kubernetes installations (which is the default in my roles).

The worker depends on the infrastructure that I installed in the control plane blog post. The role provides the following variables:

# The base directory for Kubernetes configuration and certificate files for
# everything worker nodes related. After the playbook is done this directory
# contains various sub-folders.
k8s_worker_conf_dir: "/etc/kubernetes/worker"

# All certificate files (Private Key Infrastructure related) specified in
# "k8s_worker_certificates" (see "vars/main.yml") will be stored here.
# Owner and group of this new directory will be "root". File permissions
# will be "0640".
k8s_worker_pki_dir: "{{ k8s_worker_conf_dir }}/pki"

# The directory to store the Kubernetes binaries (see "k8s_worker_binaries"
# variable in "vars/main.yml"). Owner and group of this new directory
# will be "root" in both cases. Permissions for this directory will be "0755".
#
# NOTE: The default directory "/usr/local/bin" normally already exists on every
# Linux installation with the owner, group and permissions mentioned above. If
# your current settings are different consider a different directory. But make sure
# that the new directory is included in your "$PATH" variable value.
k8s_worker_bin_dir: "/usr/local/bin"

# K8s release
k8s_worker_release: "1.28.5"

# The interface on which the Kubernetes services should listen on. As all cluster
# communication should use a VPN interface the interface name is
# normally "wg0" (WireGuard),"peervpn0" (PeerVPN) or "tap0".
#
# The network interface on which the Kubernetes worker services should
# listen on. That is:
#
# - kube-proxy
# - kubelet
#
k8s_interface: "eth0"

# The directory from where to copy the K8s certificates. By default this
# will expand to user's LOCAL $HOME (the user that run's "ansible-playbook ..."
# plus "/k8s/certs". That means if the user's $HOME directory is e.g.
# "/home/da_user" then "k8s_ca_conf_directory" will have a value of
# "/home/da_user/k8s/certs".
k8s_ca_conf_directory: "{{ '~/k8s/certs' | expanduser }}"

# The IP address or hostname of the Kubernetes API endpoint. This variable
# is used by "kube-proxy" and "kubelet" to connect to the "kube-apiserver"
# (Kubernetes API server).
#
# By default the first host in the Ansible group "k8s_controller" is
# specified here. NOTE: This setting is not fault tolerant! That means
# if the first host in the Ansible group "k8s_controller" is down
# the worker node and its workload continue working but the worker
# node doesn't receive any updates from Kubernetes API server.
#
# If you have a loadbalancer that distributes traffic between all
# Kubernetes API servers it should be specified here (either its IP
# address or the DNS name). But you need to make sure that the IP
# address or the DNS name you want to use here is included in the
# Kubernetes API server TLS certificate (see "k8s_apiserver_cert_hosts"
# variable of https://github.com/githubixx/ansible-role-kubernetes-ca
# role). If it's not specified you'll get certificate errors in the
# logs of the services mentioned above.
k8s_worker_api_endpoint_host: "{% set controller_host = groups['k8s_controller'][0] %}{{ hostvars[controller_host]['ansible_' + hostvars[controller_host]['k8s_interface']].ipv4.address }}"

# As above just for the port. It specifies on which port the
# Kubernetes API servers are listening. Again if there is a loadbalancer
# in place that distributes the requests to the Kubernetes API servers
# put the port of the loadbalancer here.
k8s_worker_api_endpoint_port: "6443"

# OS packages needed on a Kubernetes worker node. You can add additional
# packages at any time. But please be aware if you remove one or more from
# the default list your worker node might not work as expected or doesn't work
# at all.
k8s_worker_os_packages:
  - ebtables
  - ethtool
  - ipset
  - conntrack
  - iptables
  - iptstate
  - netstat-nat
  - socat
  - netbase

# Directory to store kubelet configuration
k8s_worker_kubelet_conf_dir: "{{ k8s_worker_conf_dir }}/kubelet"

# kubelet settings
#
# If you want to enable the use of "RuntimeDefault" as the default seccomp
# profile for all workloads add these settings to "k8s_worker_kubelet_settings":
#
# "seccomp-default": ""
#
# Also see:
# https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads
k8s_worker_kubelet_settings:
  "config": "{{ k8s_worker_kubelet_conf_dir }}/kubelet-config.yaml"
  "node-ip": "{{ hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address }}"
  "kubeconfig": "{{ k8s_worker_kubelet_conf_dir }}/kubeconfig"

# kubelet kubeconfig
k8s_worker_kubelet_conf_yaml: |
  kind: KubeletConfiguration
  apiVersion: kubelet.config.k8s.io/v1beta1
  address: {{ hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address }}
  authentication:
    anonymous:
      enabled: false
    webhook:
      enabled: true
    x509:
      clientCAFile: "{{ k8s_worker_pki_dir }}/ca-k8s-apiserver.pem"
  authorization:
    mode: Webhook
  clusterDomain: "cluster.local"
  clusterDNS:
    - "10.32.0.254"
  failSwapOn: true
  healthzBindAddress: "{{ hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address }}"
  healthzPort: 10248
  runtimeRequestTimeout: "15m"
  serializeImagePulls: false
  tlsCertFile: "{{ k8s_worker_pki_dir }}/cert-{{ inventory_hostname }}.pem"
  tlsPrivateKeyFile: "{{ k8s_worker_pki_dir }}/cert-{{ inventory_hostname }}-key.pem"
  cgroupDriver: "systemd"
  registerNode: true
  containerRuntimeEndpoint: "unix:///run/containerd/containerd.sock"  

# Directory to store kube-proxy configuration
k8s_worker_kubeproxy_conf_dir: "{{ k8s_worker_conf_dir }}/kube-proxy"

# kube-proxy settings
k8s_worker_kubeproxy_settings:
  "config": "{{ k8s_worker_kubeproxy_conf_dir }}/kubeproxy-config.yaml"

k8s_worker_kubeproxy_conf_yaml: |
  kind: KubeProxyConfiguration
  apiVersion: kubeproxy.config.k8s.io/v1alpha1
  bindAddress: {{ hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address }}
  clientConnection:
    kubeconfig: "{{ k8s_worker_kubeproxy_conf_dir }}/kubeconfig"
  healthzBindAddress: {{ hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address }}:10256
  mode: "ipvs"
  ipvs:
    minSyncPeriod: 0s
    scheduler: ""
    syncPeriod: 2s
  iptables:
    masqueradeAll: true
  clusterCIDR: "10.200.0.0/16"  

Make sure that k8s_interface: "wg0" is set if you use WireGuard. But it should be already set in group_vars/all.yml because it was also used by the Control Plane nodes. I’d also recommend to extend k8s_worker_kubelet_settings by one setting: "seccomp-default": "". This enables the use of “RuntimeDefault” as the default seccomp profile for all workloads. In short: This feature disables quite a few system calls e.g. reboot. There is actually no need for a container to reboot a Kubernetes host e.g. 😉 So while still allow system calls that are relevant for “normal” workload this feature disables all system calls not relevant. For more information see my Kubernetes upgrade notes: Enable default seccomp profile. Since Kubernetes v1.27 this feature is stable. Also: Enable the use of RuntimeDefault as the default seccomp profile for all workloads. To enable and make the seccomp-default usable I’ll change k8s_worker_kubelet_settings variable in group_vars/k8s_worker.yml accordingly by adding "seccomp-default": "":

k8s_worker_kubelet_settings:
  "config": "{{ k8s_worker_kubelet_conf_dir }}/kubelet-config.yaml"
  "node-ip": "{{ hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address }}"
  "kubeconfig": "{{ k8s_worker_kubelet_conf_dir }}/kubeconfig"
  "seccomp-default": ""

The role will search for the certificates I created in K8s certificate authority blog post in the directory specified in k8s_ca_conf_directory on my Ansible Controller node. The files used here are listed in k8s_worker_certificates (see vars/main.yml).

The Kubernetes worker binaries needed are listed in k8s_worker_binaries (also defined in vars/main.yml).

kubelet service can use CNI (the Container Network Interface) to manage machine level networking requirements. The CNI plugins needed were installed with the cni role which was already mentioned above.

As you might remember I’ve installed HAProxy in the previous blog post. It was installed on all Kubernetes Controller and Worker nodes. kubelet and kube-proxy should also use HAProxy to connect to kube-apiserver for higher availability. So I’ll set the following variables in group_vars/k8s_worker.yml:

k8s_worker_api_endpoint_host: "127.0.0.1"
k8s_worker_api_endpoint_port: "16443"

Now I add an entry for the worker hosts (which also includes the controller nodes as mentioned above) to Ansible’s hosts file. E.g.:

k8s_worker:
  hosts:
    k8s-01[01:03]02.i.example.com:
    k8s-01[01:03]03.i.example.com:

Then I install the role via

ansible-galaxy install githubixx.kubernetes_worker

Next I add the role to k8s.yml by extending the roles list of k8s_worker hosts list. E.g.:

  hosts: k8s_worker
  roles:
    -
      role: githubixx.kubernetes_worker
      tags: role-kubernetes-worker

After that the role gets deployed on all worker nodes:

ansible-playbook --tags=role-kubernetes-worker k8s.yml

So by now it should already be possible to fetch the state of the worker nodes:

kubectl get nodes -o wide

NAME         STATUS     ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-010102   NotReady   <none>   18m   v1.28.5   10.0.11.3     <none>        Ubuntu 22.04.3 LTS   6.5.0-14-generic   containerd://1.7.12
k8s-010103   NotReady   <none>   30s   v1.28.5   10.0.11.4     <none>        Ubuntu 22.04.3 LTS   6.5.0-14-generic   containerd://1.7.12
k8s-010202   NotReady   <none>   18m   v1.28.5   10.0.11.6     <none>        Ubuntu 22.04.3 LTS   6.5.0-14-generic   containerd://1.7.12
k8s-010203   NotReady   <none>   30s   v1.28.5   10.0.11.7     <none>        Ubuntu 22.04.3 LTS   6.5.0-14-generic   containerd://1.7.12
k8s-010302   NotReady   <none>   18m   v1.28.5   10.0.11.9     <none>        Ubuntu 22.04.3 LTS   6.5.0-14-generic   containerd://1.7.12
k8s-010303   NotReady   <none>   29s   v1.28.5   10.0.11.10    <none>        Ubuntu 22.04.3 LTS   6.5.0-14-generic   containerd://1.7.12

The STATUS column now reports NotReady for all nodes. Looking at the logs on the worker nodes there will be some errors like this:

ansible -m command -a 'journalctl -t kubelet -n 50' k8s_worker

...
May 13 11:40:40 worker01 kubelet[12132]: E0513 11:40:40.646202   12132 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
May 13 11:40:44 worker01 kubelet[12132]: W0513 11:40:44.981728   12132 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
...

This will be fixed next.

What’s missing is the software that makes it possible that pods on different hosts can communicate. Previously I used flannel. Flannel is a simple and easy way to configure a layer 3 network fabric designed for Kubernetes. But as time moves on other interesting projects pop up and one of them is Cilium.

That’s basically a one stop thing for everything which is needed for Kubernetes networking. So there is no need to install additional software for Network Policies e.g. Cilium brings API-aware network security filtering to Linux container frameworks like Docker and Kubernetes. Using a new Linux kernel technology called BPF, Cilium provides a simple and efficient way to define and enforce both network-layer and application-layer security policies based on container/pod identity. That thing has really everything like overlay networking, native routing, IPv4/v6 support, load balancing, direct server return (DSR), Gateway support (replacement for Ingress), monitoring and troubleshooting, Hubble as an observability platform, network policies, CNI and libnetwork integration, and so on. Use of BFP and XDP makes it also very fast as most of the processing is happening in the Linux kernel and not in userspace. Also documentation is just great and of course there is also a blog.

Ok, enough Cilium praise 😉 Lets install it. I prepared an Ansible Cilium role. Download via

ansible-galaxy install githubixx.cilium_kubernetes

The role is using Cilium Helm Chart in the background. So on the Ansible Controller node I need Helm 3 binary installed. This also true for some of my other roles coming up. There are at least three ways to install Helm:

  • use your favorite package manager if your distribution includes helm in its repository (for Archlinux use sudo pacman -S helm e.g.)
  • or use one of the Ansible Helm roles (e.g. helm which can be installed via ansible-galaxy role install -vr roles/githubixx.cilium_kubernetes/requirements.yml
  • or directly download the binary from [Helm releases)[https://github.com/helm/helm/releases]) and put it into /usr/local/bin/ directory e.g.

Also make sure that KUBECONFIG variable is set correctly. But this is something that I already did earlier in my blog posts.

The role does a few things on the Kubernetes nodes but most tasks are executed on the Ansible Controller node like installing the Cilium Helm chart, connecting to kube-apiserver to check the status of the Cilium deployment and stuff like that. By default the role “delegates” all tasks that need to connect to the kube-apiserver to 127.0.0.1. This can be changed with cilium_delegate_to variable. I’ll set this variable in group_vars/all.yml. In my case I’ll set it to k8s-01-ansible-ctrl.i.example.com which is actually localhost 😉 But if I need to set some variables for this host I can do so later (see further down below).

I’ll now extend the playbook k8s.yml to specify that the cilium_kubernetes role should be applied to the hosts in the k8s_worker group:

-      
  hosts: k8s_worker
  roles:
    -
      role: githubixx.cilium_kubernetes
      tags: role-cilium-kubernetes

As mentioned above the role delegated quite a few tasks to the Ansible Controller node. This also means that it’ll “delegate” the variables I set for this role. As I defined above that cilium_kubernetes role should be applied to k8s_worker hosts group I need to define the variables for this role in group_vars/k8s_worker.yml. E.g.:

cilium_chart_version: "1.14.5"
cilium_etcd_enabled: "true"
cilium_etcd_interface: "{{ k8s_interface }}"
cilium_etcd_client_port: "2379"
cilium_etcd_nodes_group: "k8s_etcd"
cilium_etcd_secrets_name: "cilium-etcd-secrets"
cilium_etcd_cert_directory: "{{ k8s_ca_conf_directory }}"
cilium_etcd_cafile: "ca-etcd.pem"
cilium_etcd_certfile: "cert-cilium.pem"
cilium_etcd_keyfile: "cert-cilium-key.pem"

If your Kubernetes cluster isn’t that big you can actually remove all cilium_etcd_* variables and just pin the Cilium Helm chart version to a specific version by setting cilium_chart_version as above. Without etcd Cilium stores its state in Kubernetes custom resources (CRDs). But since I’m adventurous I’ll run Cilium with an external etcd key-value store that I already use for my kube-apiserver. If you’ve very strong security requirements and a big cluster it might make sense to have a separate etcd cluster just for Cilium (also see Installation with external etcd).

Regarding the cilium_etcd_* values: etcd is listening on the WireGuard interface only as it’s part of the WireGuard mesh. So cilium_etcd_interface: "wg0" needs to be set or you can do something like cilium_etcd_interface: {{ k8s_interface }} as etcd_interface is already set in group_vars/all.yml and so we can keep that in sync. etcd daemons are listening on port 2379 by default. All etcd hosts are in Ansible’s k8s_etcd hosts group. The role will create a Kubernetes Secret called like value specified in cilium_etcd_secrets_name. That Secret will contain the content of the certificate files specified in cilium_etcd_cafile, cilium_etcd_cafile and cilium_etcd_cafile. Also make sure that cilium_etcd_cert_directory: "{{ k8s_ca_conf_directory }}" is set as all certificate files created with kubernetes_ca role earlier are stored there and the role needs some of them. The certificate files are needed to allow Cilium to connect to etcd.

Besides the default variables you can also adjust the variables for the Helm chart. The default values are in cilium_values_default.yml.j2. But nothing is made in stone 😉 To use your own values just create a file called cilium_values_user.yml.j2 and put it into the templates directory. Then this Cilium role will use that file to render the Helm values. You can use cilium_values_default.yml.j2 as a template or just start from scratch. As mentioned above you can modify all settings for the Cilium Helm chart that are different to the default ones which are located here.

To ensure that the correct Python version and KUBECONFIG variable is used on my Ansible Controller node I’ll set ansible_python_interpreter in host_vars/k8s-01-ansible-ctrl.i.example.com to the python binary in my Python venv. E.g.

ansible_python_interpreter: "/opt/scripts/ansible/k8s-01_vms/bin/python"

And KUBECONFIG will be set in the k8s.yml playbook file:

-
  hosts: k8s_worker
  environment:
    KUBECONFIG: "/opt/scripts/ansible/k8s-01_vms/kubeconfig/admin.kubeconfig"
  roles:
  ...

For further information see the README of the role which also describes all variables. But in general with the settings above in place I should end up with Kubernetes cluster that is able to run already some workload.

If you want to check the what Kubernetes resources will be created and the configuration options you can do so. E.g.:

ansible-playbook --tags=role-cilium-kubernetes --extra-vars cilium_template_output_directory="/tmp/cilium" k8s.yml

This wont install the resources but will create a file /tmp/cilium/template.yml on the Ansible Controller node. You can inspect the file to check if you’re fine with all the resources and values.

Now Cilium can be installed on the worker nodes:

ansible-playbook --tags=role-cilium-kubernetes -e cilium_action=install k8s.yml

After a while there should be some Cilium pods running:

kubectl --namespace cilium get pods -o wide

NAME                               READY   STATUS    RESTARTS   AGE   IP           NODE         NOMINATED NODE   READINESS GATES
cilium-2kwvz                       1/1     Running   0          81s   10.0.11.10   k8s-010303   <none>           <none>
cilium-57pgx                       1/1     Running   0          81s   10.0.11.7    k8s-010203   <none>           <none>
cilium-jrfz8                       1/1     Running   0          81s   10.0.11.6    k8s-010202   <none>           <none>
cilium-jxjws                       1/1     Running   0          81s   10.0.11.9    k8s-010302   <none>           <none>
cilium-operator-774db8f4cb-b2nzz   1/1     Running   0          81s   10.0.11.7    k8s-010203   <none>           <none>
cilium-operator-774db8f4cb-q54pk   1/1     Running   0          81s   10.0.11.10   k8s-010303   <none>           <none>
cilium-vb26s                       1/1     Running   0          81s   10.0.11.4    k8s-010103   <none>           <none>
cilium-xqk7v                       1/1     Running   0          81s   10.0.11.3    k8s-010102   <none>           <none>

You can also check the logs of the Pods with kubectl -n cilium --tail=500 logs cilium-.... e.g. Also kubectl get nodes -o wide should now show all nodes as Ready. You might also have recognized that the IP addresses are the WireGuard IPs I’ve assigned to the Kubernetes Controller and Worker nodes.

To resolve Kubernetes cluster internal DNS entries (like *.local) which is also used for auto-discovery of services, CoreDNS can be used. And that’s also the one I cover here. For this I’ll create a directory called playbooks in my venv. Then I’ll change to that directory and clone ansible-kubernetes-playbooks:

git clone https://github.com/githubixx/ansible-kubernetes-playbooks

Then switch into coredns directory. Basically you can install CoreDNS by just running ansible-playbook coredns.yml. By default this will install a CoreDNS configuration which is defined in configmap.yml.j2. DNS queries to cluster.local zone will be answered by CoreDNS. Every other DNS zone will be forwarded to Cloudflare’s 1.1.1.1 or Quad9’s 9.9.9.9 DNS server. You can change that if you want of course and make further adjustments in the ConfigMap. There is also a second CoreDNS configuration: configmap_quad9_dot.yml.j2. That’s basically the same as the previous one but uses DoT (that’s DNS over TLS). It uses Quad9’s TLS enabled DNS servers. If you want to use that one you need to change templates/configmap.yml.j2 to templates/configmap_quad9_dot.yml.j2 in coredns/tasks/install.yml.

I’ve added a detailed README to the playbook repository. Please have a look there too for further information.

So to finally install CoreDNS use:

ansible-playbook coredns.yml

If you run kubectl --namespace kube-system get pods -o wide afterwards you should see the CoreDNS servers running. E.g.:

NAME                       READY   STATUS    RESTARTS   AGE   IP          NODE         NOMINATED NODE   READINESS GATES
coredns-7fc847d54c-22dvf   1/1     Running   0          17s   10.0.5.30   k8s-010303   <none>           <none>
coredns-7fc847d54c-bkj9r   1/1     Running   0          17s   10.0.0.22   k8s-010103   <none>           <none>

In k8s_worker_kubelet_conf_yaml I defined clusterDNS: "10.32.0.254. That IP is also specified in the CoreDNS Service. And you’ll also see it here:

kubectl --namespace kube-system get svc kube-dns -o wide

NAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
kube-dns   ClusterIP   10.32.0.254   <none>        53/UDP,53/TCP,9153/TCP   45h   k8s-app=kube-dns

So if you’d like to have a different IP for the CoreDNS Service you now know where to change.

Now that I’ve installed basically everything needed for running Pods, Deployments, Services, and so on I should be able to do a sample deployment. So on my laptop I’ll run:

kubectl create namespace test
kubectl --namespace test apply -f https://k8s.io/examples/application/deployment.yaml

This will deploy two Pods running nginx. To get a overview of what’s running:

kubectl --namespace test get all -o wide

NAME                                    READY   STATUS    RESTARTS   AGE   IP           NODE         NOMINATED NODE   READINESS GATES
pod/nginx-deployment-86dcfdf4c6-l4hv7   1/1     Running   0          83s   10.0.2.143   k8s-010203   <none>           <none>
pod/nginx-deployment-86dcfdf4c6-rwjl7   1/1     Running   0          83s   10.0.0.41    k8s-010103   <none>           <none>

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES         SELECTOR
deployment.apps/nginx-deployment   2/2     2            2           83s   nginx        nginx:1.14.2   app=nginx

NAME                                          DESIRED   CURRENT   READY   AGE   CONTAINERS   IMAGES         SELECTOR
replicaset.apps/nginx-deployment-86dcfdf4c6   2         2         2       83s   nginx        nginx:1.14.2   app=nginx,pod-template-hash=86dcfdf4c6

Or kubectl --namespace test describe deployment nginx-deployment also does the job.

You should be also able get the default nginx page on every worker node from one of the two nginx webservers. I use Ansible’s get_url module here and one should see something similar like this (I truncated the output a bit):

ansible -m get_url -a "url=http://10.0.2.143 dest=/tmp/test.html" k8s_worker

k8s-010302.i.example.com | CHANGED => {
    "changed": true,
    "checksum_dest": null,
    "checksum_src": "7dd71afcfb14e105e80b0c0d7fce370a28a41f0a",
    "dest": "/tmp/test.html",
    "elapsed": 0,
    "gid": 0,
    "group": "root",
    "md5sum": "e3eb0a1df437f3f97a64aca5952c8ea0",
    "mode": "0644",
    "msg": "OK (612 bytes)",
    "owner": "root",
    "size": 612,
    "src": "/home/ansible/.ansible/tmp/ansible-tmp-1705337744.24709-566702-78220146282609/tmp68i1ohfd",
    "state": "file",
    "status_code": 200,
    "uid": 0,
    "url": "http://10.0.2.143"
}
k8s-010103.i.example.com | CHANGED => {
  ...
}

This should give a valid result no matter on which node the page is fetched. Cilium “knows” on which node the Pod with the IP 10.0.2.143 is located and the request gets routed accordingly. If you’re done you can delete the nginx deployment again with kubectl --namespace test delete deployment nginx-deployment (but maybe wait a little bit as the deployment is convenient for further testing…).

You can output the worker internal IPs and the pod CIDRs that were assigned to that host with:

kubectl get nodes --output=jsonpath='{range .items[*]}{.status.addresses[?(@.type=="InternalIP")].address} {.spec.podCIDR} {"\n"}{end}'

10.0.11.3 10.200.0.0/24 
10.0.11.4 10.200.4.0/24 
10.0.11.6 10.200.2.0/24 
10.0.11.7 10.200.3.0/24 
10.0.11.9 10.200.1.0/24 
10.0.11.10 10.200.5.0/24

The IP addresses 10.0.11.xxx are addresses I assigned to the WireGuard VPN interface (wg0 in my case) to the worker and controller nodes. That’s important since all communication should travel though the VPN interfaces.

If you just want to see if the worker nodes are ready use:

kubectl get nodes -o wide

You should now see that STATUS changed from NotReady to Ready.

If you want to test network connectivity, DNS and stuff like that a little bit we can deploy kind of a debug container which is just the slim version of a Docker Debian image e.g.:

kubectl --namespace test run --attach testpod --rm --image=debian:stable-slim --restart=Never -- sh -c "sleep 14400"

This may take a little bit until the container image was downloaded. In a second terminal run

kubectl --namespace test exec -it testpod -- bash

After entering the container a few utilities should be installed:

apt-get update && apt-get install iputils-ping iproute2 dnsutils curl telnet

Now it should be possible to resolve the internal IP of kube-apiserver e.g. (which should be 10.32.0.1 if you kept the default Pod IP range setting):

root@testpod:/# dig +short kubernetes.default.svc.cluster.local
10.32.0.1

or

root@testpod:/# dig www.microsoft.com

; <<>> DiG 9.18.19-1~deb12u1-Debian <<>> www.microsoft.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39431
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 7e754940a38fd61a (echoed)
;; QUESTION SECTION:
;www.microsoft.com.             IN      A

;; ANSWER SECTION:
www.microsoft.com.      17      IN      CNAME   www.microsoft.com-c-3.edgekey.net.
www.microsoft.com-c-3.edgekey.net. 17 IN CNAME  www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net.
www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net. 17 IN CNAME e13678.dscb.akamaiedge.net.
e13678.dscb.akamaiedge.net. 17  IN      A       23.35.229.160

;; Query time: 292 msec
;; SERVER: 10.32.0.254#53(10.32.0.254) (UDP)
;; WHEN: Mon Jan 15 18:21:11 UTC 2024
;; MSG SIZE  rcvd: 363

Or resolve the IP address of a pod (that’s one of the nginx container deployed above into test namespace):

root@debug-pod:/# dig +short 10-0-2-143.test.pod.cluster.local
10.0.2.143

In both cases the DNS query was resolved by CoreDNS at 10.32.0.254. So resolving external and internal cluster.local DNS queries works as expected. 10.32.0.254 is again kinda load balancer IP. It’s assigned to a Kubernetes Service called kube-dns as already mentioned above.

It should also be possible to fetch the default HTML site from the nginx deployment (output truncated):

curl http://10.0.2.143

<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

...

If you’re done with testing you can delete the created resources (if not done already). E.g.:

kubectl --namespace test delete pod testpod
kubectl --namespace test delete deployments.apps nginx-deployment
kubectl delete namespaces test

There is one final thing to do: As mentioned previously no “normal” workload should be executed on the Kubernetes Control Plane nodes. This can be done with the following task. It will add a so called Taint to all Control Plane nodes k8s-01[01:03]02 (also see Well-Know Labels, Annotations and Taints). I’ll create a file playbooks/taint_controller.yml with the following content:

---
- name: Taint Kubernetes Controller nodes
  hosts: k8s-01-ansible-ctrl.i.example.com
  gather_facts: true
  tasks:
    - name: Taint Kubernetes control plane nodes
      kubernetes.core.k8s_taint:
        kubeconfig: "/opt/scripts/ansible/k8s-01_vms/kubeconfig/admin.kubeconfig"
        state: present
        name: "{{ hostvars[item]['inventory_hostname_short'] }}"
        taints:
          - effect: NoSchedule
            key: "node-role.kubernetes.io/control-plane"
      with_inventory_hostnames:
        - k8s_controller

The change can be applied with ansible-playbook playbooks/taint_controller.yml. The task will be executed on my Ansible Controller node k8s-01-ansible-ctrl.i.example.com. If node-role.kubernetes.io/control-plane:NoSchedule Taint is applied, Control Plane nodes allow only critical workloads to be scheduled and that includes the Cilium pods (they’ve a Toleration operator: Exists which basically allows them to run everywhere).

At this state the Kubernetes cluster is basically fully functional 😄 But of course there are lots more that could be done…

There’re a lot more things that could/should be done now but running Sonobuoy could be a good next step. Sonobuoy is a diagnostic tool that makes it easier to understand the state of a Kubernetes cluster by running a set of Kubernetes conformance tests (ensuring CNCF conformance) in an accessible and non-destructive manner. The test can run quite long (about an hour) but starting it is as quick as (check if there is a newer version available):

cd /tmp
wget https://github.com/vmware-tanzu/sonobuoy/releases/download/v0.57.1/sonobuoy_0.57.1_linux_amd64.tar.gz
tar xvfz sonobuoy_0.57.1_linux_amd64.tar.gz
export KUBECONFIG=/opt/scripts/ansible/k8s-01_vms/kubeconfig/admin.kubeconfig
./sonobuoy run --wait

After that is done you can inspect the results:

results=$(./sonobuoy retrieve)
./sonobuoy results $results
./sonobuoy delete --wait

Also you may have a look at Velero. It’s a utility for managing disaster recovery, specifically for your Kubernetes cluster resources and persistent volumes.

You may also want to have some monitoring e.g. by using Prometheus + Alertmanager and creating some nice Dashboards with Grafana. Also having a nice a Kubernetes Dashboard like Lens might be helpful.

Having centralized logs from containers and the Kubernetes nodes is also something very useful. For this Loki and again Grafana might be an option but there are also various “logging stacks” like ELK ElasticSearch, Logstash and Kibana out there that could make life easier.

But I’ll do something completely different first 😉 Up until now nobody from the outside can access any service that runs on the Kubernetes cluster. For this something called Ingress is needed. So lets continue with Kubernetes the Not So Hard Way With Ansible - Ingress with Traefik v2 and cert-manager (Part 1). In this blog post I’ll install Traefik ingress controller and cert-manager to automatically fetch and renew TLS certificates from Lets Encrypt.