Kubernetes the not so hard way with Ansible - etcd cluster - (K8s v1.27)

2023-09-10

  • update to etcd v3.5.8 for Kubernetes v1.27
  • update etcd_settings

2022-01-13

  • update to etcd v3.5.1 for Kubernetes v1.22
  • remove log-package-levels setting from etcd_settings as etcd 3.5 does not like empty values for this parameter. So if you need this parameter just add it to etcd_settings_user with a sensible value. Otherwise etcd wont start.

2021-09-12

  • no changes for Kubernetes v1.21

2020-07-05

  • no changes for Kubernetes v1.20

2020-11-03

  • update to etcd v3.4.14
  • etcd_data_dir permissions changed to 0700. Before the permissions were not set so in most cases that ended up with 0755. This was needed because of CHANGELOG-3.4

2020-08-06

  • no version etcd version change for Kubernetes v1.18
  • changed some default values for etcd_settings. (cert|key)-file and peer-(cert|key)-file now uses different certificates:
"cert-file": "{{etcd_conf_dir}}/cert-etcd-server.pem"
"key-file": "{{etcd_conf_dir}}/cert-etcd-server-key.pem"
"peer-cert-file": "{{etcd_conf_dir}}/cert-etcd-peer.pem"
"peer-key-file": "{{etcd_conf_dir}}/cert-etcd-peer-key.pem"

Therefore etcd_certificates list was also adjusted accordingly and the Ansible command to check cluster health.

2020-04-05

  • upgrade to etcd v3.4.7 (latest major version supported/recommended for Kubernetes v1.17)
  • enable v2 API again like in etcd v3.3.x
  • rename deprecated log-output flag to log-outputs
  • remove --cors flag
  • set flag --log-outputs=systemd/journal and add flag --logger=zap

2019-11-14

  • upgrade to etcd v3.3.15 (latest major version supported/recommended for Kubernetes v1.16)

2019-09-12

  • no changes needed for Kubernetes v1.15

2019-05-20

  • upgrade to etcd v3.3.13 (latest major version supported/recommended for Kubernetes v1.14)

2019-01-14

  • etcd v3.2.24 is still the latest version supported/recommended for Kubernetes v1.13

2018-12-09

  • upgrade to etcd v3.2.24 (latest version supported/recommended for Kubernetes v1.12)

2018-09-30

  • upgrade to etcd v3.2.18 (latest version supported/recommended for Kubernetes v1.11)
  • introduced etcd_ca_conf_directory variable to get more independent from other Kubernetes roles. In our case etcd_ca_conf_directory just points to the value of k8s_ca_conf_directory variable.

This post is based on Kelsey Hightower’s Kubernetes The Hard Way - Bootstrapping the etcd cluster.

In the previous part certificate authority we installed our PKI (public key infrastructure) in order to secure communication between our Kubernetes components/infrastructure. Now we use the certificate authorities (CA) and generated keys for the first and very important component - the etcd cluster. etcd is basically a distributed key/value database. The Kubernetes components are stateless. All state is stored in etcd so you should take care of your etcd cluster in production. If you loose all etcd nodes you loose the whole Kubernetes state… So making a snapshot/backup from time to time is - at least - recommended 😉

I want to mention that if your etcd nodes won’t join then a possible reason could be the certificate. If it isn’t your firewall blocking traffic between your etcd nodes the certificate’s host list could be the problem. The error message isn’t always clear about the issue.

As usual we add the role ansible-role-etcd to the k8s.yml file e.g.:

  hosts: k8s_etcd
  roles:
    -
      role: githubixx.etcd
      tags: role-etcd

Next install the role via

ansible-galaxy install githubixx.etcd

(or just clone the Github repo whatever you like). Basically you don’t need to change a lot of variables but you can if you want of course:

# The directory from where to copy the etcd certificates. By default this
# will expand to user's LOCAL $HOME (the user that run's "ansible-playbook ..."
# plus "/etcd-certificates". That means if the user's $HOME directory is e.g.
# "/home/da_user" then "etcd_ca_conf_directory" will have a value of
# "/home/da_user/etcd-certificates".
etcd_ca_conf_directory: "{{ '~/etcd-certificates' | expanduser }}"

# etcd Ansible group
etcd_ansible_group: "k8s_etcd"
# etcd version
etcd_version: "3.5.8"
# Port where etcd listening for clients
etcd_client_port: "2379"
# Port where etcd is listening for it's peer's
etcd_peer_port: "2380"
# Interface to bind etcd ports to
etcd_interface: "wg0"
# Directory for etcd configuration
etcd_conf_dir: "/etc/etcd"
# Directory to store downloaded etcd archive
# Should not be deleted to avoid downloading over and over again
etcd_download_dir: "/opt/etcd"
# Directory to store etcd binaries
etcd_bin_dir: "/usr/local/bin"
# etcd data directory (etcd database files so to say)
etcd_data_dir: "/var/lib/etcd"
# Architecture to download and install
etcd_architecture: "amd64"
# Only change this if the architecture you are using is unsupported (for example: arm64)
# For more information, see this: https://github.com/etcd-io/website/blob/master/content/docs/v3.4/op-guide/supported-platform.md
etcd_allow_unsupported_archs: false
# By default etcd tarball gets downloaded from official
# etcd repository. This can be changed to some custom
# URL if needed. For more information which protocols
# can be used see:
# https://docs.ansible.com/ansible/latest/collections/ansible/builtin/get_url_module.html
# It's only important to keep the filename naming schema:
# "etcd-v{{ etcd_version }}-linux-{{ etcd_architecture }}.tar.gz"
etcd_download_url: "https://github.com/etcd-io/etcd/releases/download/v{{ etcd_version }}/etcd-v{{ etcd_version }}-linux-{{ etcd_architecture }}.tar.gz"
# By default the SHA256SUMS file is used to verify the
# checksum of the tarball archive. This can also be
# changed to your needs.
etcd_download_url_checksum: "sha256:https://github.com/coreos/etcd/releases/download/v{{ etcd_version }}/SHA256SUMS"

etcd_settings:
  "name": "{{ ansible_hostname }}"
  "cert-file": "{{ etcd_conf_dir }}/cert-etcd-server.pem"
  "key-file": "{{ etcd_conf_dir }}/cert-etcd-server-key.pem"
  "trusted-ca-file": "{{ etcd_conf_dir }}/ca-etcd.pem"
  "peer-cert-file": "{{ etcd_conf_dir }}/cert-etcd-peer.pem"
  "peer-key-file": "{{ etcd_conf_dir }}/cert-etcd-peer-key.pem"
  "peer-trusted-ca-file": "{{ etcd_conf_dir }}/ca-etcd.pem"
  "advertise-client-urls": "{{ 'https://' + hostvars[inventory_hostname]['ansible_' + etcd_interface].ipv4.address + ':' + etcd_client_port }}"
  "initial-advertise-peer-urls": "{{ 'https://' + hostvars[inventory_hostname]['ansible_' + etcd_interface].ipv4.address + ':' + etcd_peer_port }}"
  "listen-peer-urls": "{{ 'https://' + hostvars[inventory_hostname]['ansible_' + etcd_interface].ipv4.address + ':' + etcd_peer_port }}"
  "listen-client-urls": "{{ 'https://' + hostvars[inventory_hostname]['ansible_' + etcd_interface].ipv4.address + ':' + etcd_client_port + ',https://127.0.0.1:' + etcd_client_port }}"
  "peer-client-cert-auth": "true"            # Enable peer client cert authentication
  "client-cert-auth": "true"                 # Enable client cert authentication
  "initial-cluster-token": "etcd-cluster-0"  # Initial cluster token for the etcd cluster during bootstrap.
  "initial-cluster-state": "new"             # Initial cluster state ('new' or 'existing')
  "data-dir": "{{ etcd_data_dir }}"          # etcd data directory (etcd database files so to say)
  "wal-dir": ""                              # Dedicated wal directory ("" means no separated WAL directory)
  "auto-compaction-retention": "0"           # Auto compaction retention in hour. 0 means disable auto compaction.
  "snapshot-count": "100000"                 # Number of committed transactions to trigger a snapshot to disk
  "heartbeat-interval": "100"                # Time (in milliseconds) of a heartbeat interval
  "election-timeout": "1000"                 # Time (in milliseconds) for an election to timeout. See tuning documentation for details
  "max-snapshots": "5"                       # Maximum number of snapshot files to retain (0 is unlimited)
  "max-wals": "5"                            # Maximum number of wal files to retain (0 is unlimited)
  "quota-backend-bytes": "0"                 # Raise alarms when backend size exceeds the given quota (0 defaults to low space quota)
  "logger": "zap"                            # Specify ‘zap’ for structured logging or ‘capnslog’.
  "log-outputs": "systemd/journal"           # Specify 'stdout' or 'stderr' to skip journald logging even when running under systemd
  "enable-v2": "true"                        # enable v2 API to stay compatible with previous etcd 3.3.x (needed for flannel e.g.)
  "discovery-srv": ""                        # Discovery domain to enable DNS SRV discovery, leave empty to disable. If set, will override initial-cluster.

# Certificate authority and certificate files for etcd
etcd_certificates:
  - ca-etcd.pem               # certificate authority file
  - ca-etcd-key.pem           # certificate authority key file
  - cert-etcd-peer.pem        # peer TLS cert file
  - cert-etcd-peer-key.pem    # peer TLS key file
  - cert-etcd-server.pem      # server TLS cert file
  - cert-etcd-server-key.pem  # server TLS key file

Make sure to change etcd_interface to wg0 instead of tap0 if you followed my blog series so far and used Wireguard VPN and the interface wg0. That’s important as the etcd cluster nodes and the Kubernetes API server of course must be able to talk to each other!

The etcd default flags/settings defined in etcd_settings can be overridden by defining a variable called etcd_settings_user. You can also add additional settings by using this variable. E.g. to override the default value for log-outputs setting add a new setting like grpc-keepalive-min-time add the following settings to group_vars/all.yml e.g.:

etcd_settings_user:
  "log-outputs": "stdout"
  "grpc-keepalive-min-time": "10s"

The role will search for the certificates we created in certificate authority in the directory you specify in etcd_ca_conf_directory (which in our case is just the value of k8s_ca_conf_directory) on the host you run Ansible (so you can basically set etcd_ca_conf_directory: "{{ k8s_ca_conf_directory }}"). The files used here are listed in etcd_certificates.

We can deploy the role now via

ansible-playbook --tags=role-etcd k8s.yml

This will install the etcd cluster and start the etcd daemons. Have a look at the logs of your etcd hosts if everything worked and the etcd nodes are connected. Use journalctl --no-pager or journalctl -f or journalctl -t etcd to check the systemd log.

Afterwards we can use Ansible to check the cluster status e.g.:

ansible -m shell -e "etcd_conf_dir=/etc/etcd" -a 'ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://{{ ansible_wg0.ipv4.address }}:2379 \
--cacert={{ etcd_conf_dir }}/ca-etcd.pem \
--cert={{ etcd_conf_dir }}/cert-etcd-server.pem \
--key={{ etcd_conf_dir }}/cert-etcd-server-key.pem' \
k8s_etcd

I use Ansible’s shell module here. I also set a variable etcd_conf_dir which points to the directory where the etcd certificate files are located. That should be the same value as the value of etcd_conf_dir variable of the etcd role. Since my etcd processes listen on the Wireguard interface I use ansible_wg0.ipv4.address here as wg0 is the name of my Wireguard interface. If you use a different port than 2379 then of course you need to change that one too. You should see now a output similar to this:

etcd-node1 | CHANGED | rc=0 >>
https://10.8.0.101:2379 is healthy: successfully committed proposal: took = 2.807665ms
etcd-node2 | CHANGED | rc=0 >>
https://10.8.0.103:2379 is healthy: successfully committed proposal: took = 2.682864ms
etcd-node3 | CHANGED | rc=0 >>
https://10.8.0.102:2379 is healthy: successfully committed proposal: took = 10.169332ms

Next we’ll install the Kubernetes control plane.