0

I'm running AWS EKS, running on Fargate, and using Kubernetes to orchestrate multiple cron jobs. I spin roughly 1000 pods up and down over the course of a day.

Very seldomly(once every 3 weeks) one of the pods gets stuck in ContainerCreating and just hangs there and because I have concurrency disabled that particular job will never run. The fix is simply terminating the job or the pod and having it restart but this is a manual intervention.

Is there a way to get a pod to terminate or restart, if it takes too long to create?

The reason for the pod getting stuck varies quite a bit. A solution would need to be general. It can be a time based solution as all the pods are running the same code with different configurations so the startup time is relatively consistent.

Damian Jacobs
  • 488
  • 6
  • 21

1 Answers1

1

Sadly there is no mecanism to stop a job if it fail at image pulling or container creating. I also tried to do what you are trying to achieve.

You can set a backoffLimit inside your template. But it won't handle the number of retries during containerCreating, only while running.

What you can do is a script that makes describes of each pods in namespace. And try to parse it and restart the pod if it is stuck in containerCreating.

Or try to debug/trace what is causing this. kubectl describe pods to get info when your pod is in containerCreating.

BeGreen
  • 765
  • 1
  • 13
  • 39
  • Thank you, I feared that essentially creating a secondary process to check the state and age of pods would be the only option. The chance of failure is like 0,005% and reason for failing varies. I think the last one was that one of the temporary mounted volumes didn't create before the pod and when it tried to mount, it failed. This is the first time I've seen this particular error. – Damian Jacobs Jul 04 '22 at 12:07
  • 1
    If you describre your job in yml, you can get the `status` -> `startTime` Example here https://stackoverflow.com/questions/48934491/kubernetes-how-to-delete-pods-based-on-age-creation-time – BeGreen Jul 04 '22 at 12:19