Kubernetes upgrade notes: 1.16.x to 1.17.x

If you used my Kubernetes the Not So Hard Way With Ansible blog posts to setup a Kubernetes (K8s) cluster this notes might be helpful for you (and maybe for others too that manage a K8s cluster on their own).

I’ve a general upgrade guide Kubernetes the Not So Hard Way With Ansible - Upgrading Kubernetes that worked quite well for me for the last past K8s upgrades. So please read that guide if you want to know HOW the components are updated. This post here is esp. for the 1.16.x to 1.17.x upgrade and WHAT I changed.

First: As usual I don’t update a production system before the .2 release of a new major version is released. In my experience the .0 and .1 are just too buggy. Of course it is important to test new releases already in development or integration systems and report bugs!

Second: I only upgrade from the latest version of the former major release. In my case I was running 1.16.3 and at the time writing this text 1.16.8 was the latest 1.16.x release. After reading the 1.16.x changelog to see if any important changes where made between 1.16.3 and 1.16.8 I don’t saw anything that prevented me updating and I don’t needed to change anything. So I did the 1.16.3 to 1.16.8 upgrade first. If you use my Ansible roles that basically only means to change k8s_release variable from 1.16.3 to 1.16.8 and roll the changes for the control plane and worker nodes out as described in my upgrade guide. After that everything still worked as expected so I continued with the next step.

Here are two links that might be interessting regarding what’s new in regards to new features in Kubernetes 1.17:
What’s New in Kubernetes 1.17: A Deeper Look at New Features
What’s New in Kubernetes 1.17

Since K8s 1.14 there are also searchable release notes available. You can specify the K8s version and a K8s area/component (e.g. kublet, apiserver, …) and immediately get an overview what changed in that regard. Quite nice! :-)

As it is normally no problem to have a newer kubectl utility that is only one major version ahead of the server version I also updated kubectl from 1.16.x to 1.17.4 using my kubectl Ansible role.

One important external dependency for me on the list was etcd. That changed from version 3.3.13 to 3.4.3. As this is a quite critical component in the whole K8s setup great care should be taken. That said I first read the etcd 3.3 to 3.4 upgrade guide. This document says:

In the general case, upgrading from etcd 3.3 to 3.4 can be a zero-downtime, rolling upgrade ...

That’s at least a good start ;-) Reading further upgrading to 3.4 should be possible without further changes. For further information how to upgrade etcd read Kubernetes the Not So Hard Way With Ansible - Upgrading Kubernetes if you used my etcd Ansible role. This blog post also contains a hint how to backup your etcd cluster before upgrading (which is a good idea in general ;-) )

There is one important thing you need to make sure if you upgrade to etcd 3.4.x and use flannel: etcd 3.4.x disables v2 API by default (see https://github.com/etcd-io/etcd/blob/master/Documentation/upgrades/upgrade_3_4.md#make-etcd---enable-v2false-default). But flannel needs v2 API. So you you need to enable it with --enable-v2=true. Otherwise you’ll see error messages like Subnet watch failed: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint. in systemd journal (also see https://github.com/coreos/flannel/issues/1191).

In CHANGELOG-1.17 there is an important note if you use nodes with CSI raw block volume:

Storage: A node that uses a CSI raw block volume needs to be drained before kubelet can be upgraded to 1.17

The need for this is because of this pull request: Separate staging/publish and unstaging/unpublish logics for block. So do not forget to drain a worker node before upgrading it e.g. kubectl drain workerXX --ignore-daemonsets --force --grace-period=30 --timeout=30s (of course replace workerXX with the name of your worker node - use kubectl get nodes to get worker node names). Do not forget to kubectl uncordon workerXX after you updated your worker node to allow pods to be scheduled on that node again. Since you evicted pods from that node it’s maybe also a good time to update all operating system packages, kernel, …

Reading further the CHANGELOG-1.17 we have this here:

Deprecate the default service IP CIDR. The previous default was 10.0.0.0/24 which will be removed in 6 months/2 releases. Cluster admins must specify their own desired value, by using --service-cluster-ip-range on kube-apiserver.

The --service-cluster-ip-range flag was already set quite a while in my Ansible controller playbook to 10.32.0.0/16. If you don’t used it yet now it’s time.

Next:

- All resources within the rbac.authorization.k8s.io/v1alpha1 and rbac.authorization.k8s.io/v1beta1 API groups are deprecated in favor of rbac.authorization.k8s.io/v1, and will no longer be served in v1.20. 

I adjusted that accordingly in my Ansible controller playbook.

One possible interesting feature is Graduate ScheduleDaemonSetPods to GA. That’s explained in detail at Scheduled by default scheduler. In short ScheduleDaemonSetPods allows you to schedule DaemonSets using the default scheduler instead of the DaemonSet controller, by adding the NodeAffinity term to the DaemonSet pods, instead of the .spec.nodeName term. The default scheduler is then used to bind the pod to the target host. Without ScheduleDaemonSetPods when preemption is enabled, the DaemonSet controller will make scheduling decisions without considering pod priority and preemption e.g.

If you use CSI then also check the CSI Sidecar Containers documentation. Every sidecar container contains a matrix which version you need at a minimum, maximum and which version is recommend to use with whatever K8s version. Since this is quite new stuff basically all CSI sidecar container are working with K8s 1.13 to 1.17. The first releases of these sidecar containers only need K8s 1.10 but I wouldn’t use this old versions. So there is at least no urgent need to upgrade CSI sidecar containers ATM. Nevertheless if your K8s update to v1.17 worked fine I would recommend to also update the CSI sidecar containers sooner or later because a) lots of changes happen ATM in this area and b) you might require the newer versions for the next K8s anyways.

Now I finally updated the K8s controller and worker nodes to version 1.17.4 as described in Kubernetes the Not So Hard Way With Ansible - Upgrading Kubernetes.

If you see errors like

Apr 05 18:58:30 controller03 kube-controller-manager[3375]: E0405 18:58:30.109867    3375 leaderelection.go:331] error retrieving resource lock kube-system/kube-controller-manager: leases.coordination.k8s.io "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system"

during upgrading the controller nodes then this seems to be ok. The error should go away if all controller nodes are using the new Kubernetes version (also see https://github.com/gardener/gardener/issues/1879).

That’s it for today! Happy upgrading! ;-)