Kubernetes job keeps spinning up pods which end up with the 'Error' status

Question

I'm working on a Kubernetes cron job which represents an integration test; it is Go test binary that is compiled with go test -c and copied into a Docker container run by the cron job. The Kubernetes YAML is starts similar to the following:

apiVersion: batch/v1beta1
kind: CronJob
spec:
  schedule: "*/15 * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 7
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never

At some point, the integration test starting failing (exiting with code 1). I can see that the job has the same duration as its age:

$ kubectl get jobs -l app=integration-test
NAME                          COMPLETIONS   DURATION   AGE
integration-test-1592457300   0/1           7m20s      7m20s

The kubectl get pods commands shows that pods being created more frequently than every 15 minutes as I would expect from the cron schedule:

$ kubectl get pods -l app=integration-test
NAME                                READY   STATUS   RESTARTS   AGE
integration-test-1592457300-224x8   0/1     Error    0          92s
integration-test-1592457300-5f8sz   0/1     Error    0          7m33s
integration-test-1592457300-9zvjq   0/1     Error    0          3m57s
integration-test-1592457300-th7sf   0/1     Error    0          6m26s
integration-test-1592457300-vhbr2   0/1     Error    0          5m17s

This behavior of spinning up new pods is problematic because it contributes to the running pod count on the node - essentially, it consumes resources.

How can I make it such that the cron job doesn't keep spinning up new pods, but only does one every 15 minutes, and doesn't continue to consume resources if the job fails?

Update

A simplified example of this uses a Kubernetes YAML adapted from https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/:

$ cat cronjob.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster; exit 1
          restartPolicy: Never

Note that it exits with code 1. If I run this using kubernetes apply -f cronjob.yaml and then check the pods, I see

$ kubectl get pods
NAME                                                    READY   STATUS      RESTARTS   AGE
hello-1592459760-fnvcw                                  0/1     Error       0          30s
hello-1592459760-w75lt                                  0/1     Error       0          31s
hello-1592459760-xzhwn                                  0/1     Error       0          20s

The ages of the pods are less than a minute apart; in other words, pods are spun up before the cron interval has elapsed. How can I prevent this?

score 15 · Accepted Answer · answered Jun 18 '20 at 16:39

It's quite specific scenario and it's hard to guess what you want to achievie and if it will work for you.

concurrencyPolicy: Forbid prevents to create another job if previous was not completed. But I think that is not the case here.

restartPolicy applies to pod (however in Job template you can use only OnFailure and Never). If you will set restartPolicy to Never, job will create automatically new pods till completion.

A Job creates one or more Pods and ensures that a specified number of them successfully terminate. As pods successfully complete, the Job tracks the successful completions.

If you set restartPolicy: Never it will be creating pods till it will reach backoffLimit, however those pods will be still visible in your cluster with Error status as each pod exit with status 1. You would need to remove it manually. If you would set restartPolicy: OnFailure it will be restarting one pod and will not create more.

But there is another way. What is considered as completed job?

Examples:

1. restartPolicy: OnFailure

$ kubectl get po,jobs,cronjob
NAME                         READY   STATUS             RESTARTS   AGE
pod/hello-1592495280-w27mt   0/1     CrashLoopBackOff   5          5m21s
pod/hello-1592495340-tzc64   0/1     CrashLoopBackOff   5          4m21s
pod/hello-1592495400-w8cm6   0/1     CrashLoopBackOff   5          3m21s
pod/hello-1592495460-jjlx5   0/1     CrashLoopBackOff   4          2m21s
pod/hello-1592495520-c59tm   0/1     CrashLoopBackOff   3          80s
pod/hello-1592495580-rrdzw   0/1     Error              2          20s
NAME                         COMPLETIONS   DURATION   AGE
job.batch/hello-1592495220   0/1           6m22s      6m22s
job.batch/hello-1592495280   0/1           5m22s      5m22s
job.batch/hello-1592495340   0/1           4m22s      4m22s
job.batch/hello-1592495400   0/1           3m22s      3m22s
job.batch/hello-1592495460   0/1           2m22s      2m22s
job.batch/hello-1592495520   0/1           81s        81s
job.batch/hello-1592495580   0/1           21s        21s
NAME                  SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/hello   */1 * * * *   False     6        25s             15m

Each job will create only 1 pod which will be restarted till job will be finished or will be considered as completed by CronJob.

If you will describe CronJob in Event section you can find.

Events:
  Type    Reason            Age                  From                Message
  ----    ------            ----                 ----                -------
  Normal  SuccessfulCreate  18m                  cronjob-controller  Created job hello-1592494740
  Normal  SuccessfulCreate  17m                  cronjob-controller  Created job hello-1592494800
  Normal  SuccessfulCreate  16m                  cronjob-controller  Created job hello-1592494860
  Normal  SuccessfulCreate  15m                  cronjob-controller  Created job hello-1592494920
  Normal  SuccessfulCreate  14m                  cronjob-controller  Created job hello-1592494980
  Normal  SuccessfulCreate  13m                  cronjob-controller  Created job hello-1592495040
  Normal  SawCompletedJob   12m                  cronjob-controller  Saw completed job: hello-1592494740
  Normal  SuccessfulCreate  12m                  cronjob-controller  Created job hello-1592495100
  Normal  SawCompletedJob   11m                  cronjob-controller  Saw completed job: hello-1592494800
  Normal  SuccessfulDelete  11m                  cronjob-controller  Deleted job hello-1592494740
  Normal  SuccessfulCreate  11m                  cronjob-controller  Created job hello-1592495160
  Normal  SawCompletedJob   10m                  cronjob-controller  Saw completed job: hello-1592494860

Why job hello-1592494740 was considered as Completed? Cronjob default value of .spec.backoffLimit is 6 (this information can be found in docs). If job will fail 6 times (pod will fail to restart 6 times) Cronjob will consider this job as Completed and it will remove it. As job was removed, also pod will be removed.

However, in your example, pod was created, pod executed date and echo command and then exit with code 1. Even if pod is Crashing it wrote information. As last command was exit 1 so it will be Crashing till it reach limit. As per example below:

$ kubectl get pods
NAME                     READY   STATUS             RESTARTS   AGE
hello-1592495400-w8cm6   0/1     Terminating        6          5m51s
hello-1592495460-jjlx5   0/1     CrashLoopBackOff   5          4m51s
hello-1592495520-c59tm   0/1     CrashLoopBackOff   5          3m50s
hello-1592495580-rrdzw   0/1     CrashLoopBackOff   4          2m50s
hello-1592495640-nbq59   0/1     CrashLoopBackOff   4          110s
hello-1592495700-p6pcx   0/1     Error              3          50s
user@cloudshell:~ (project)$ kubectl logs hello-1592495520-c59tm
Thu Jun 18 15:55:13 UTC 2020
Hello from the Kubernetes cluster

2. restartPolicy: Never and backoffLimit: 0

YAML below was used:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster; exit 1
          restartPolicy: Never
      backoffLimit: 0

Output

$ kubectl get po,jobs,cronjob
NAME                         READY   STATUS   RESTARTS   AGE
pod/hello-1592497320-svd6k   0/1     Error    0          44s
NAME                         COMPLETIONS   DURATION   AGE
job.batch/hello-1592497320   0/1           44s        44s
NAME                  SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/hello   */1 * * * *   False     0        51s             11m

$ kubectl describe cronjob
...
Events:
  Type    Reason            Age                  From                Message
  ----    ------            ----                 ----                -------
  Normal  SuccessfulCreate  12m                  cronjob-controller  Created job hello-1592496720
  Normal  SawCompletedJob   11m                  cronjob-controller  Saw completed job: hello-1592496720
  Normal  SuccessfulCreate  11m                  cronjob-controller  Created job hello-1592496780
  Normal  SawCompletedJob   10m                  cronjob-controller  Saw completed job: hello-1592496780
  Normal  SuccessfulDelete  10m                  cronjob-controller  Deleted job hello-1592496720
  Normal  SuccessfulCreate  10m                  cronjob-controller  Created job hello-1592496840
  Normal  SuccessfulDelete  9m55s                cronjob-controller  Deleted job hello-1592496780
  Normal  SawCompletedJob   9m55s                cronjob-controller  Saw completed job: hello-1592496840
  Normal  SuccessfulCreate  9m5s                 cronjob-controller  Created job hello-1592496900
  Normal  SawCompletedJob   8m55s                cronjob-controller  Saw completed job: hello-1592496900
  Normal  SuccessfulDelete  8m55s                cronjob-controller  Deleted job hello-1592496840
  Normal  SuccessfulCreate  8m5s                 cronjob-controller  Created job hello-1592496960
  Normal  SawCompletedJob   7m55s                cronjob-controller  Saw completed job: hello-1592496960
  Normal  SuccessfulDelete  7m55s                cronjob-controller  Deleted job hello-1592496900
  Normal  SuccessfulCreate  7m4s                 cronjob-controller  Created job hello-1592497020

This way only one job and one pod will be at the same time running (there might be 10 seconds gap when there will be 2 jobs and 2 pods).

$ kubectl get po,job
NAME                         READY   STATUS   RESTARTS   AGE
pod/hello-1592497440-twzlf   0/1     Error    0          70s
pod/hello-1592497500-2q7fq   0/1     Error    0          10s

NAME                         COMPLETIONS   DURATION   AGE
job.batch/hello-1592497440   0/1           70s        70s
job.batch/hello-1592497500   0/1           10s        10s
user@cloudshell:~ (project)$ kk get po,job
NAME                         READY   STATUS   RESTARTS   AGE
pod/hello-1592497500-2q7fq   0/1     Error    0          11s

NAME                         COMPLETIONS   DURATION   AGE
job.batch/hello-1592497500   0/1           11s        11s

I hope it cleared a bit. If you would like more precise answer, please give more information about your scenario.

score 0 · Answer 2 · answered Jun 18 '20 at 06:08

Default concurrencyPolicy: Allow.

You can set concurrencyPolicy: Forbid to avoid parallel running new jobs.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "* * * * *"
  # Allow | Forbid | Replace
  concurrencyPolicy: Forbid
  jobTemplate:

Kubernetes job keeps spinning up pods which end up with the 'Error' status

Update

2 Answers2

Linked