I am using aws EKS with a managed node group. Twice in the passed couple of weeks I had a case where the Kubelet in one of the nodes crashed or stopped reporting back to the control plane.
In this case I would expect the Autoscaling group to identify this node as unhealthy, and replace it. However, this is not what happens. I have recreated the issue by creating a node and manually stopping the Kubelet, see image below:
My first thought was to create an Event Bus alert that would trigger a lambda to take care of this but I couldn't find the EKS service in the list of services in Event Bus, so …
Does anyone know of a tool or configuration that would help with this? To be clear I am looking for something that would:
- Detect that that kubelet isn't connecting to the control plane
- Delete the node in the cluster
- Terminate the EC2
THANKS!!