How Can I Reduce Detecting the Node Failure Time on Kubernetes?

Question

I have 2 Slave and 1 Master node kubernetes cluster.When a node down it takes approximately 5 minutes to kubernetes see that failure.I am using dynamic provisioning for volumes and this time is a little bit much for me.How can i reduce that detecting failure time ? I found a post about it: https://fatalfailure.wordpress.com/2016/06/10/improving-kubernetes-reliability-quicker-detection-of-a-node-down/

At the bottom of the post,it says, we can reduce that detection time by changing that parameters:

kubelet: node-status-update-frequency=4s (from 10s)
controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)

i can change node-status-update-frequency parameter from kubelet but i don't have any controller manager program or command on the cli.How can i change that parameters? Any other suggestions about reducing detect downtime will be appreciated.

Veerendra K · Answer 1 · 2019-04-22T09:56:52.437

2

..but i don't have any controller manager program or command on the cli.How can i change that parameters?

You can change/add that parameter in controller-manger systemd unit file and restart the daemon. Please check the man pages for controller-manager here.

If you deploy controller-manager as micro service(pod), check the manifest file for that pod and change the parameters at container's command section(For example like this)

edited Apr 22 '19 at 09:56

answered Apr 22 '19 at 09:51

Veerendra K

2,145
7
32
61

There is a manifest file could be about it: /etc/kubernetes/manifests/kube-controller-manager.yaml Can i add that flags and apply that manifest file ? Is it useful ? kubectl apply -f kube-controller-manager.yaml – Adi Soyadi Apr 22 '19 at 11:17
Yes, you can modify that manifest. Probably you will need to restart kubelet after that. – Vasili Angapov Apr 22 '19 at 11:23
Unfortunately manifest file gives crashloopbackoff.I tried /etc/systemd/system/kubelet.service.d/10-kubeadm.conf also but it does not give any effect. when i give describe command it shows nothing: Back-off restarting failed container – Adi Soyadi Apr 22 '19 at 16:38
@AdiSoyadi, what does it saying? I didnt remember exactly, can you check how those pods are deployed i.e `replicaset` or `deamonset` in `kube-system` namespace. Then open the manifest file for `replicaset`/`daemonset` and edit. – Veerendra K Apr 23 '19 at 06:52

score 0 · Answer 2 · answered Apr 22 '19 at 09:43

0

It's actually kube-controller-manager. You may also decrease --attach-detach-reconcile-sync-period from 1m to 15 or 30 seconds for kube-controller-manager. This will allow for more speedy volumes attach-detach actions. How you change those parameters depends on how you set up the cluster.

answered Apr 22 '19 at 09:43

Vasili Angapov

8,061
15
31

Thanks for reply.My actual problem is i cannot find any documentation about kube-controller-manager and i do not know how to setup and how to use it.My cluster:2slavex1master on premise cluster.(virtualbox) – Adi Soyadi Apr 22 '19 at 11:20
Hi, could you please edit /etc/kubernetes/manifests/kube-controller-manager.yaml and add necessary flags as described by community @Veerendra [here](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) – Mark Apr 30 '19 at 13:35

How Can I Reduce Detecting the Node Failure Time on Kubernetes?

2 Answers2

Linked