9

I am using aws EKS with a managed node group. Twice in the passed couple of weeks I had a case where the Kubelet in one of the nodes crashed or stopped reporting back to the control plane.

In this case I would expect the Autoscaling group to identify this node as unhealthy, and replace it. However, this is not what happens. I have recreated the issue by creating a node and manually stopping the Kubelet, see image below:

enter image description here

My first thought was to create an Event Bus alert that would trigger a lambda to take care of this but I couldn't find the EKS service in the list of services in Event Bus, so …

Does anyone know of a tool or configuration that would help with this? To be clear I am looking for something that would:

  1. Detect that that kubelet isn't connecting to the control plane
  2. Delete the node in the cluster
  3. Terminate the EC2

THANKS!!

yammering
  • 126
  • 8
  • Have you got the solution? – thinkingmonster Jul 21 '22 at 16:27
  • 1
    I don't think the AutoScaling group aware of the node unhealthy because it only care about the node metrics, Control Plane is the one who have this information and should be talking with others components like clusterautoscaler to do the create/destroy of nodes. In your case I have some suggestions: - enable autoscaling first to ensure your application availability, then start ssh to the error node to debug * check your networking, VPC, CIDR * check if your cluster have 3rd party CNI like Cillium... or something that miss configured * check roles and permissions – Brody Oct 05 '22 at 04:31
  • You can implement the above mechanism but it still and ad-hoc method to fix the problems that we don't even know the cause, if can please put the log here so the community can help, Let us know what your findings, this going to be a great diagnosing work for you – Brody Oct 05 '22 at 04:36
  • We've moved to using Karppenter for autoscaling, its been a bit finicky but it work well once configured. – yammering Oct 27 '22 at 07:52

1 Answers1

1

I would suggest looking at the node-problem-detector or this blog by Cloudflare. There is an issue on the EKS roadmap for automated node health checking. I would upvote the issue if it's important to you.

Jeremy Cowan
  • 563
  • 4
  • 13