0

I am running a batch job in my EKS cluster that trains a ML model and the training goes on for 8-10hours. However, it seems like the node on which the job runs moves is killed and the job is restarted on a new node. I am monitoring the Node in Prometheus and seems like there was no CPU or OOM issue.

My next bet was to look into the EKS cloudtrail logs and right when the node is removed I see below events -

  • kube-controller-manager log
controller_utils.go:179] Recording status change NodeNotReady event message for node XXX
controller_utils.go:121] Update ready status of pods on node [XXX]
event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"XXX", UID:"1bf33ec8-41cd-434a-8579-3ba4b8cdd5f1", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node XXX status is now: NodeNotReady
node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
I0609 01:00:48.962465 1 node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
node_lifecycle_controller.go:180] deleting node since it is no longer present in cloud provider: XXX
  • kube-scheduler log
node_tree.go:113] Removed node "XXX" in group "us-east-2:\x00:us-east-2b" from NodeTree

I checked the kubelet logs but it does not have any message moving the node to NotReady status. I was expecting to atleast see this message in the kubelet log - https://github.com/kubernetes/kubernetes/blob/e9de1b0221dd8687aba527e682fafc7c33370c09/pkg/kubelet/kubelet_node_status.go#L682

Which makes me wonder if the kubelet dies or the node is not reachable or any connection lost from kube-api-server to kubelet on that node.

I have been working on this for days to debug this issue but with no success.*

Note: The batch job running in Kubernetes do run successfully eventually on restart. Also this issue is sporadic i.e sometime the restart happens and sometimes it does not and finishes in the first run.

John Conde
  • 217,595
  • 99
  • 455
  • 496
Aks
  • 154
  • 2
  • 13
  • Do you use ASG? Or what might be the reason behind deleted node. – Matt Jun 10 '21 at 07:50
  • we do not use ASG. That is what I am wondering, I do not know what is the reason behing deleted node which makes me thing that AWS must be deleting the instance or something. – Aks Jun 10 '21 at 11:54
  • Did you check the cloudwatch logs for who/what is deleting the node/ec2 instance? – Matt Jun 10 '21 at 12:28
  • Which service logs should I check? i.e Who would be deleting the node/ec2 instance? – Aks Jun 10 '21 at 13:05
  • Also I am wondering from the above logs that the node is getting marked as NotReady and then the cluster autoscaler is bringing down the node. So it maybe that the autoscaler is deleting the node. – Aks Jun 10 '21 at 13:37
  • ouh, so you are using cluster autoscaler? Yes, this may be it. Check out this: https://stackoverflow.com/questions/63871413/how-to-make-sure-kubernetes-autoscaler-not-deleting-the-nodes-which-runs-specifi – Matt Jun 10 '21 at 13:47
  • Thanks for pointing to that. I will take a look. However, this may prevent from deleting the node where the batch job runs but how can I know what exactly is causing the node to be `NotReady` – Aks Jun 10 '21 at 13:53
  • Check the cluster autoscaler logs. maybe there is some info that it deleted the node. – Matt Jun 10 '21 at 13:54

1 Answers1

0

Are you using spot instance nodes? That might be one of the reason where the node gets terminated based on the spot / bid price changes. Try dedicated instance.

sai kumar
  • 61
  • 1
  • 1
  • 2