3

I have a problem with controller-manager and scheduler not responding, that is not related to github issues I've found (rancher#11496, azure#173, …)

Two days ago we had a memory overflow by one POD on one Node in our 3-node HA cluster. After that rancher webapp was not accessible, we found the compromised pod and scaled it to 0 over kubectl. But that took some time, figuring everything out.

Since then rancher webapp is working properly, but there are continuous alerts from controller-manager and scheduler not working. Alerts are not consist, sometimes they are both working, some times their health check urls are refusing connection.

NAME                 STATUS      MESSAGE                                                                                     ERROR
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
scheduler            Healthy     ok                                                                                     
etcd-0               Healthy     {"health": "true"}                                                                     
etcd-2               Healthy     {"health": "true"}                                                                     
etcd-1               Healthy     {"health": "true"}

Restarting controller-manager and scheduler on compromised Node hasn’t been effective. Even reloading all of the components with

docker restart kube-apiserver kubelet kube-controller-manager kube-scheduler kube-proxy wasn’t effective either.

Can someone please help me figure out the steps towards troubleshooting and fixing this issue without downtime on running containers?

Nodes are hosted on DigitalOcean on servers with 4 Cores and 8GB of RAM each (Ubuntu 16, Docker 17.03.3).

Thanks in advance !

ralic
  • 75
  • 1
  • 3
  • 10
  • share the logs from controller pod. it helps – P Ekambaram Feb 22 '19 at 14:12
  • Thanks for the comment! Can you please help me with that? Me not knowing and not being able to find controller pod is one of the problems. `kubectl get pods --namespace kube-system` does not list `controller-manager` nor `scheduler` – ralic Feb 22 '19 at 14:17
  • Actually `kubectl get pods --all-namespaces` doesn't seem to list anything that is "controller-manager"-like or I really do not know what I am looking for... – ralic Feb 22 '19 at 14:26
  • it is located in kube-system namespace – P Ekambaram Feb 22 '19 at 14:29
  • Ok maybe that's the problem.. `kubectl get pods --namespace kube-system` returns these pods `canal-XXXXX` x3 `cert-manager-XXXXX` `kube-dns-XXXXX` `kube-dns-autoscaler-XXXXX` `metrics-server-XXXXX` `rke-ingress-controller-deploy-job-XXXXX` `rke-kubedns-addon-deploy-job-XXXXX` `rke-metrics-addon-deploy-job-XXXXX` `rke-network-plugin-deploy-job-XXXXX` `tiller-deploy-XXXXX` Does this make any sense to you? – ralic Feb 22 '19 at 14:38
  • do you see all of those pods running? what do you see in logs from controller pod – P Ekambaram Feb 22 '19 at 14:40
  • Every pod is Running instead of `rke-xxx-deploy-job-xxx` pods that are Completed, and I assume that's how it should be, because they are deploy jobs. Can you please point out to me what is the "Controller Pod" out of all these, because maybe that's the biggest confusion with me right now. What's the command I need to run to see logs from controller pod? – ralic Feb 22 '19 at 14:46

2 Answers2

4

The first area to look at would be your logs... Can you export the following logs and attach them?

/var/log/kube-controller-manager.log

The controller manager is an endpoint, so you will need to do a "get endpoint". Can you run the following:

kubectl -n kube-system get endpoints kube-controller-manager

and

kubectl -n kube-system describe endpoints kube-controller-manager

and

kubectl -n kube-system get endpoints kube-controller-manager -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
Soroush
  • 907
  • 1
  • 9
  • 27
  • I got the same question - the log is empy. and below the output of the commands after: $ kubectl -n kube-system get endpoints kube-controller-manager NAME ENDPOINTS AGE kube-controller-manager 5d19h $ kubectl -n kube-system describe endpoints kube-controller-manager Name: kube-controller-manager Namespace: kube-system Labels: Annotations: control-plane.alpha.kubernetes.io/leader: {"holderIdentity":"master_cdd7e148..8d6","leaseDur":15,"acqTime":"2020-11-02","renewTime"... Subsets: Events: – Charbel Nov 03 '20 at 19:37
  • $ kubectl -n kube-system get endpoints kube-controller-manager -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' --->> {"holderIdentity":"master_cdd7e148-64cb-4d07-8ec9-1858309988d6","leaseDurationSeconds":15,"acquireTime":"2020-11-02T22:46:50Z","renewTime":"2020-11-03T19:38:30Z","leaderTransitions":6} – Charbel Nov 03 '20 at 19:39
1

Please run this command in master nodes

sed -i 's|- --port=0|#- --port=0|' /etc/kubernetes/manifests/kube-scheduler.yaml
sed -i 's|- --port=0|#- --port=0|' /etc/kubernetes/manifests/kube-controller-manager.yaml

systemctl restart kubelet

After restarting the kubelet, the problem will be solved.

  • Best answer, it's working well, maybe linked to this [issue](https://www.claudiokuenzler.com/blog/1049/rancher2-kubernetes-cluster-errors-alerts-controller-manager-scheduler-deep-dive) – lupaulus Sep 04 '21 at 09:43