EKS node moves to NodeNotReady state when running a batch jobs

Question

I am running a batch job in my EKS cluster that trains a ML model and the training goes on for 8-10hours. However, it seems like the node on which the job runs moves is killed and the job is restarted on a new node. I am monitoring the Node in Prometheus and seems like there was no CPU or OOM issue.

My next bet was to look into the EKS cloudtrail logs and right when the node is removed I see below events -

kube-controller-manager log

controller_utils.go:179] Recording status change NodeNotReady event message for node XXX
controller_utils.go:121] Update ready status of pods on node [XXX]
event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"XXX", UID:"1bf33ec8-41cd-434a-8579-3ba4b8cdd5f1", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node XXX status is now: NodeNotReady
node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
I0609 01:00:48.962465 1 node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
node_lifecycle_controller.go:180] deleting node since it is no longer present in cloud provider: XXX

kube-scheduler log

node_tree.go:113] Removed node "XXX" in group "us-east-2:\x00:us-east-2b" from NodeTree

I checked the kubelet logs but it does not have any message moving the node to NotReady status. I was expecting to atleast see this message in the kubelet log - https://github.com/kubernetes/kubernetes/blob/e9de1b0221dd8687aba527e682fafc7c33370c09/pkg/kubelet/kubelet_node_status.go#L682

Which makes me wonder if the kubelet dies or the node is not reachable or any connection lost from kube-api-server to kubelet on that node.

I have been working on this for days to debug this issue but with no success.*

Note: The batch job running in Kubernetes do run successfully eventually on restart. Also this issue is sporadic i.e sometime the restart happens and sometimes it does not and finishes in the first run.

Do you use ASG? Or what might be the reason behind deleted node. — Matt, Jun 10 '21 at 07:50
we do not use ASG. That is what I am wondering, I do not know what is the reason behing deleted node which makes me thing that AWS must be deleting the instance or something. — Aks, Jun 10 '21 at 11:54
Did you check the cloudwatch logs for who/what is deleting the node/ec2 instance? — Matt, Jun 10 '21 at 12:28
Which service logs should I check? i.e Who would be deleting the node/ec2 instance? — Aks, Jun 10 '21 at 13:05
Also I am wondering from the above logs that the node is getting marked as NotReady and then the cluster autoscaler is bringing down the node. So it maybe that the autoscaler is deleting the node. — Aks, Jun 10 '21 at 13:37
ouh, so you are using cluster autoscaler? Yes, this may be it. Check out this: https://stackoverflow.com/questions/63871413/how-to-make-sure-kubernetes-autoscaler-not-deleting-the-nodes-which-runs-specifi — Matt, Jun 10 '21 at 13:47
Thanks for pointing to that. I will take a look. However, this may prevent from deleting the node where the batch job runs but how can I know what exactly is causing the node to be `NotReady` — Aks, Jun 10 '21 at 13:53
Check the cluster autoscaler logs. maybe there is some info that it deleted the node. — Matt, Jun 10 '21 at 13:54

score 0 · Answer 1 · answered Aug 04 '21 at 16:36

0

Are you using spot instance nodes? That might be one of the reason where the node gets terminated based on the spot / bid price changes. Try dedicated instance.

answered Aug 04 '21 at 16:36

sai kumar

61
1
1
2

EKS node moves to NodeNotReady state when running a batch jobs

1 Answers1

Linked