gke job container restarted

Question

In my project, GKE runs many jobs daily. Sometimes I see that a job runs twice: the first time partially and the second time fully, although "restartPolicy: Never" is defined. It happens very seldom (about one time per 200 - 300 runs).

This is an example:

I 2020-12-03T00:12:45Z Started container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:12:45Z Created container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:12:45Z Successfully pulled image "gcr.io/xxxxx/mot-del-old-validations:v16" 
I 2020-12-03T00:12:40Z Pulling image "gcr.io/xxxxx/mot-del-old-validations:v16" 
I 2020-12-03T00:12:39Z Stopping container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:01:59Z Started container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:01:59Z Created container mot-test-deleteoldvalidations-container 
I 2020-12-03T00:01:59Z Successfully pulled image "gcr.io/xxxx/mot-del-old-validations:v16" 
I 2020-12-03T00:01:40Z Pulling image "gcr.io/xxxxx/mot-del-old-validations:v16"

From job's YAML:

spec:
  backoffLimit: 0
  completions: 1
  parallelism: 1
resources:
          limits:
            cpu: "1"
            memory: 2500Mi
          requests:
            cpu: 500m
            memory: 2Gi
        nsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:

The reason for stopping container is "Killing". How can I avoid this behavior?

Could you share your job/cronjob manifest? What GKE version are you using? — PjoterS, Dec 07 '20 at 09:27
I use 1.17.13-gke, but some things happen with 1.16. YAML file is too big. Which parameters do you mean? — BT3, Dec 07 '20 at 10:12
I've asked about your cronjob manifest as I'd like to replicate it on my cluster. What parameters do you have in `failedJobsHistoryLimit`, `successfulJobsHistoryLimit`. It's `Parallel` or `Non-Parallel` jobs? How often this situation occurs? Each 10, 100, 1000 jobs? Is it possible to provide `kubectl logs ` ? — PjoterS, Dec 07 '20 at 14:32
regarding job (stdout) log. It looks like the program works a certain time and suddenly restarted from the beginning. — BT3, Dec 08 '20 at 10:20
According to [Handling Pod and container failures](https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures) `Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice.` Could you share more details, what is your job doing? In spec you have only those 3 parameters, others are default? Do you know how long a single job takes or it didnt reach resource limits? — PjoterS, Dec 09 '20 at 16:52

score 1 · Answer 1 · edited Sep 04 '22 at 18:29

As you mention in comment section, your NetworkPolicy is set to Never. You have also set spec.backoffLimit, spec.complementions and spec.parallelism which should work. However, the Documentation - Handling Pod and container failures mentioned that this behavior is possible and it's not considered as a problem.

Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice.

As addition, in CronJob documentation, the best practise is to make jobs Idempotent.

A cron job creates a job object about once per execution time of its schedule. We say "about" because there are certain circumstances where two jobs might be created, or no job might be created. We attempt to make these rare, but do not completely prevent them. Therefore, jobs should be idempotent.

In computing, an idempotent operation is one that has no additional effect if it is called more than once with the same input parameters. For example, removing an item from a set can be considered an idempotent operation on the set.

As your whole job manifest is still a mystery, two workarounds come to my mind. Depends on the scenario it might help.

First workaround

Use PodAntiAffinity which won't allow deploy the second pod with the same label/selector.

Second workaround

Use initContainer lock, so the first pod puts a lock, and the second pod, if lock is detected wait for 3-5 seconds and exit.

Because init containers run to completion before any app containers start, init containers offer a mechanism to block or delay app container startup until a set of preconditions are met.

gke job container restarted

1 Answers1