I'm running AWS EKS, running on Fargate, and using Kubernetes to orchestrate multiple cron jobs. I spin roughly 1000 pods up and down over the course of a day.
Very seldomly(once every 3 weeks) one of the pods gets stuck in ContainerCreating and just hangs there and because I have concurrency disabled that particular job will never run. The fix is simply terminating the job or the pod and having it restart but this is a manual intervention.
Is there a way to get a pod to terminate or restart, if it takes too long to create?
The reason for the pod getting stuck varies quite a bit. A solution would need to be general. It can be a time based solution as all the pods are running the same code with different configurations so the startup time is relatively consistent.