Context
We have long running kubernetes jobs based on docker containers. The containers needs resources (eg 15gb memory, 2 cpu) and we use autoscaler to scale up new worker nodes at request.
Scenario
Users can select the version of the docker image to be used for a job, eg 1.0.0, 1.1.0, or even a commit hash of the code the image was build from in test environment.
As we leave the docker tag to be freetext, the user can type a non-existing docker tag. Because of this the job pod comes in ImagePullBackOff state. The pod stays in this state and keeps the resources locked so that they cannot be reused by any other job.
Question
What is the right solution, that can be applied in kubernetes itself, for failing the pod immediately or at least quickly if a pull fails due to a non existing docker image:tag?
Possibilities
I looked into backofflimit. I have set it to 0, but this doesn't fail or remove the job. The resources are of course kept as well.
Maybe they can be killed by a cron job. Not sure how to do so.
Ideally, resources should not even be allocated for a job with an unexisting docker image. But I'm not sure if there is a possibility to easily achieve this.
Any other?