Kubernetes limit number of retry

Question

For some context, I'm creating an API in python that creates K8s Jobs with user input in ENV variables.

Sometimes, it happens that the Image selected does not exist or has been deleted. Secrets does not exists or Volume isn't created. So it makes the Job in a crashloopbackoff or imagepullbackoff state.

First I'm am wondering if the ressource during this state are allocated to the job?

If yes, I don't want the Job to loop forever and lock resources to a never starting Job.

I've set the backofflimit to 0, but this is when the Job detect a Pod that goes in fail and tries to relaunch an other Pod to retry. In my case, I know that if a Pod fails for a job, then it's mostly due to OOM or code that fails and will always fails due to user input. So retrying will always fail.

But it doesn't limit the number of tries to crashloopbackoff or imagepullbackoff. Is there a way to set to terminate or fail the Job? I don't want to kill it, but just free the ressource and keep the events in (status.container.state.waiting.reason + status.container.state.waiting.message) or (status.container.state.terminated.reason + status.container.state.terminated.exit_code)

Could there be an option to set to limit the number of retry at the creation so I can free resources, but not to remove it to keep logs.

score 0 · Answer 1 · answered May 06 '22 at 16:01

0

I have tested your first question and YES even if a pod is in crashloopbackoff state, the resources are still allocated to it !!! Here is my test: Are the Kubernetes requested resources by a pod still allocated to it when it is in crashLoopBackOff state?

Thanks for your question !

answered May 06 '22 at 16:01

Bguess

1,700
1
11
24

Thanks a lot for answering this. Do you know by any chance on how to brake the crash loop after X attempts? – BeGreen May 06 '22 at 18:53
In my opinion there is no such parameters in kubernetes for the moment... You can maybe write a script that got enough authorization to check if a pod is in crashLoopBackOff and delete it's ownerReference object (example: pod is owned by a replicaset that is owned by a deployment OR pod is owned by a job directly, here is a command for example to check the ownerReference of pods in custom-columns: kubectl get po -o custom-columns=NAME:".metadata.name",OWNER:".metadata.ownerReferences[0].name",OWNER_KIND:".metadata.ownerReferences[0].kind". ) – Bguess May 06 '22 at 22:38
Some people talk about using operators like https://github.com/flant/shell-operator but is it not overkilled? I mean if you have a crashLoopBackOff you have to fix the issues to not let this happen again right? However I would be great to have such a feature that unallocates requested memory... Remove resources request? ... Not clean. However ! I looked a long time something to do this but apart writing a script there is no other solutions I found ‍♂️ – Bguess May 06 '22 at 22:41
Indeed crashloopback should be fixed. But in my case, I let developers launch them self a k8s template with few customisation with the python Kubernetes package. And they are not familiar with k8s. Anyway, any thing that goes do production is checked. So crashloopback shouldn't happen. If it does, in my python API I did a thread that checks for crashloopback or imagepulleroor. Send an email alert with the k8s logs – BeGreen May 07 '22 at 23:06

Mostafa Wael · Answer 2 · 2022-06-01T18:22:55.673

Long answer short, unfortunately there is no such option in Kubernetes.

However, you can do this manually by checking if the pod is in a crashloopbackoff then, unallocate its resources or simply delete the pod itself.

The following script delete any pod in the crashloopbackoff state from a specified namespace

#!/bin/bash
# This script check the passed namespace and delete pods in 'CrashLoopBackOff state 

NAMESPACE="test"
delpods=$(sudo kubectl get pods -n ${NAMESPACE} |
  grep -i 'CrashLoopBackOff' |
  awk '{print $1 }')    

for i in ${delpods[@]}; do

  sudo kubectl delete pod $i --force=true --wait=false \
    --grace-period=0 -n ${NAMESPACE}
    
done

Since we have passed the option --grace-period=0 the pod won't automatically restart again. But, if after using this script or assigning it to a job, you noticed that the pod continues to restart and fall in the CrashLoopBackOff state again for some weird reason. Thera is a workaround for this, which is changing the restart policy of the pod:

A PodSpec has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always. restartPolicy applies to all Containers in the Pod. restartPolicy only refers to restarts of the Containers by the kubelet on the same node. Exited Containers that are restarted by the kubelet are restarted with an exponential back-off delay (10s, 20s, 40s …) capped at five minutes, and is reset after ten minutes of successful execution. As discussed in the Pods document, once bound to a node, a Pod will never be rebound to another node.

See more details in the documentation or from here.

And that is it! Happy hacking.

Regarding the first question, it is already answered by bguess here.

Please, consider giving me the bounty if this answers your question. — Mostafa Wael, May 10 '22 at 08:37
For deleting all failed pods in all namespaces you can use this command: `kubectl delete pods --field-selector status.phase=Failed -A`. But, using this command, you need to the `restartPolicy`as the pod will restart again and again. — Mostafa Wael, May 10 '22 at 08:57

Kubernetes limit number of retry

2 Answers2

Linked