I have a production cluster is currently running on K8s version 1.19.9
, where the kube-scheduler and kube-controller-manager failed to have leader elections. The leader is able to acquire the first lease, however it then cannot renew/reacquire the lease, this has caused other pods to constantly in the loop of electing leaders as none of them could stay on long enough to process anything/stay on long enough to do anything meaningful and they time out, where another pod will take the new lease; this happens from node to node. Here are the logs:
E1201 22:15:54.818902 1 request.go:1001] Unexpected error when reading response body: context deadline exceeded
E1201 22:15:54.819079 1 leaderelection.go:361] Failed to update lock: resource name may not be empty
I1201 22:15:54.819137 1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F1201 22:15:54.819176 1 controllermanager.go:293] leaderelection lost
Detailed Docker logs:
Flag --port has been deprecated, see --secure-port instead.
I1201 22:14:10.374271 1 serving.go:331] Generated self-signed cert in-memory
I1201 22:14:10.735495 1 controllermanager.go:175] Version: v1.19.9+vmware.1
I1201 22:14:10.736289 1 dynamic_cafile_content.go:167] Starting request-header::/etc/kubernetes/pki/front-proxy-ca.crt
I1201 22:14:10.736302 1 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
I1201 22:14:10.736684 1 secure_serving.go:197] Serving securely on 0.0.0.0:10257
I1201 22:14:10.736747 1 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-controller-manager...
I1201 22:14:10.736868 1 tlsconfig.go:240] Starting DynamicServingCertificateController
E1201 22:14:20.737137 1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://[IP address]:[Port]/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": context deadline exceeded
E1201 22:14:32.803658 1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://[IP address]:[Port]/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": context deadline exceeded
E1201 22:14:44.842075 1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://[IP address]:[Port]/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": context deadline exceeded
E1201 22:15:13.386932 1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: context deadline exceeded
I1201 22:15:44.818571 1 leaderelection.go:253] successfully acquired lease kube-system/kube-controller-manager
I1201 22:15:44.818755 1 event.go:291] "Event occurred" object="kube-system/kube-controller-manager" kind="Endpoints" apiVersion="v1" type="Normal" reason="LeaderElection" message="master001_1d360610-1111-xxxx-aaaa-9999 became leader"
I1201 22:15:44.818790 1 event.go:291] "Event occurred" object="kube-system/kube-controller-manager" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="master001_1d360610-1111-xxxx-aaaa-9999 became leader"
E1201 22:15:54.818902 1 request.go:1001] Unexpected error when reading response body: context deadline exceeded
E1201 22:15:54.819079 1 leaderelection.go:361] Failed to update lock: resource name may not be empty
I1201 22:15:54.819137 1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F1201 22:15:54.819176 1 controllermanager.go:293] leaderelection lost
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc00000e001, 0xc000fb20d0, 0x4c, 0xc6)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x6a57fa0, 0xc000000003, 0x0, 0x0, 0xc000472070, 0x68d5705, 0x14, 0x125, 0x0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:945 +0x191
My duct tape recovery method was to shutdown the other candidates and disable leader elections --leader-elect=false
. We manually set a leader and let it stay on for a while, then reactivated leader elections after. This has seemed to work as intended again, the leases are renewing normally after.
Could it be possible that the api-server may be too overwhelmed to expend any resources(?), because the elections have failed due to timeout? Was wondering if anyone has ever encountered such an issue.