ICP fails to start after machine reboot

Question

I have ICP V2.1 installed into a RHEL VMWare image. After rebooting the image, ICP fails to start in what appears to be the first known issue in the documentation (Kubernetes controller manager fails to start after a master or cluster restart). However, the prescribed resolution does not get my system going.

Here is the running pod list:

NAME READY STATUS RESTARTS AGE calico-node-amd64-dtl47 2/2 Running 14 20h filebeat-ds-amd64-mvcsj 1/1 Running 8 20h k8s-etcd-192.168.232.131 1/1 Running 7 20h k8s-mariadb-192.168.232.131 1/1 Running 7 20h k8s-master-192.168.232.131 2/3 CrashLoopBackOff 15 17m k8s-proxy-192.168.232.131 1/1 Running 7 20h metering-reader-amd64-gkwt4 1/1 Running 7 20h monitoring-prometheus-nodeexporter-amd64-sghrv 1/1 Running 7 20h

Removing the k8s-master-192.168.232.131 pod and allowing it to restart only puts it back into the CrashLoopBackOff state. Here is how the last line in controller manager log looks:

F1029 23:55:07.345341 1 controllermanager.go:176] error building controller context: failed to get supported resources from server: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1alpha1: an error on the server ("Error: 'dial tcp 10.0.0.145:443: getsockopt: connection refused'\nTrying to reach: 'https://10.0.0.145:443/apis/servicecatalog.k8s.io/v1alpha1'") has prevented the request from succeeding

Removing the pod or removing the failed controller master docker container directly has no effect. It seems like another service hasn't started yet, or failed to start. I've waited several hours to see if the issue resolves itself, but to no avail.

Thanks...

score 1 · Accepted Answer · edited Nov 08 '17 at 01:15

Before the fix of https://github.com/kubernetes/kubernetes/pull/49495, kuberentes controller manager failed to start if an registered extension-apiserver is not ready. In ICP, service catalog is implemented as extension-apiserver.

Usually after ICP master is restarted, kubelet will start the k8s management service first as static pod. After that, it will get pods/nodes/service information from kubernetes api server, and then start all the pods including catalog api service. For that case, the whole cluster is recovered.

However for your case, there is a race condition that when kubelet get pods information from kuberentes api server and start all the pods, it has not get the nodes information from kubernetes api server yet. As a result, kubelet failed to start catalog api service due to nodeSelector is not met. The whole cluster failed to be recovered.

In next release of ICP 2.1.0.1, kuberentes will be upgraded into 1.8.2 with the fix of https://github.com/kubernetes/kubernetes/pull/49495. The issue will be resolved completely.

Before that you could try the following workaround method.

Use the -s flag form of the kubectl command if your token has expired after restart and you no longer have access to the GUI to re-establish it.

Delete apiservices of v1alpha1.servicecatalog.k8s.io

kubectl delete apiservices v1alpha1.servicecatalog.k8s.io

kubectl -s 127.0.0.1:8888 delete apiservices v1alpha1.servicecatalog.k8s.io
Delete the dead controller manager

docker rm <k8s controller manager>
Wait until service catalog started
Recover the service catalog apiservices by re-register the apiservice of v1alpha1.servicecatalog.k8s.io

kubectl apply -f cluster/cfc-components/service-catalog/apiregistration.yaml

kubectl -s 127.0.0.1:8888 apply -f cluster/cfc-components/service-catalog/apiregistration.yaml

Just emailed you the log. Interestingly, the ps -a showed that the last time it ended was 41 hours ago, and hasn't been restarted since over multiple reboots. — robo, Oct 31 '17 at 19:57

ICP fails to start after machine reboot

1 Answers1