Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-13T02:40:46Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"e1d093448d0ed9b9b1a48f49833ff1ee64c05ba5", GitTreeState:"clean", BuildDate:"2021-06-03T00:20:57Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
I have a Kubernetes crobjob that serves the purpose of running some Azure cli commands on a time based schedule.
Running the container locally works fine, however, manually triggering the Cronjob through Lens, or letting it run per the schedule results in weird behaviour (Running in the cloud as a job yeilds unexpected results).
Here is the cronjob definition:
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
I ran the cronjob manually and it created job development-scale-down-manual-xwp1k
. Describing this job after it completed, we can see the following:
$ kubectl describe job development-scale-down-manual-xwp1k
Name: development-scale-down-manual-xwp1k
Namespace: development
Selector: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Annotations: <none>
Parallelism: 1
Completions: 1
Start Time: Wed, 04 Aug 2021 09:40:28 +1200
Active Deadline Seconds: 360s
Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels: controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Containers:
scaler:
Image: myimage:latest
Port: <none>
Host Port: <none>
Environment:
CLUSTER_NAME: ...
NODEPOOL_NAME: ...
NODEPOOL_SIZE: ...
RESOURCE_GROUP: ...
SP_APP_ID: <set to the key 'application_id' in secret 'scaler-secrets'> Optional: false
SP_PASSWORD: <set to the key 'application_pass' in secret 'scaler-secrets'> Optional: false
SP_TENANT: <set to the key 'application_tenant' in secret 'scaler-secrets'> Optional: false
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 24m job-controller Created pod: development-scale-down-manual-xwp1k-b858c
Normal SuccessfulCreate 23m job-controller Created pod: development-scale-down-manual-xwp1k-xkkw9
Warning BackoffLimitExceeded 23m job-controller Job has reached the specified backoff limit
This differs from other issues I have read, where it does not mention a "SuccessfulDelete" event.
The events received from kubectl get events
tell an interesting story
$ ktl get events | grep xwp1k
3m19s Normal Scheduled pod/development-scale-down-manual-xwp1k-b858c Successfully assigned development/development-scale-down-manual-xwp1k-b858c to aks-burst-37275452-vmss00000d
3m18s Normal Pulling pod/development-scale-down-manual-xwp1k-b858c Pulling image "myimage:latest"
2m38s Normal Pulled pod/development-scale-down-manual-xwp1k-b858c Successfully pulled image "myimage:latest" in 40.365655229s
2m23s Normal Created pod/development-scale-down-manual-xwp1k-b858c Created container myimage
2m23s Normal Started pod/development-scale-down-manual-xwp1k-b858c Started container myimage
2m12s Normal Killing pod/development-scale-down-manual-xwp1k-b858c Stopping container myimage
2m12s Normal Scheduled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully assigned development/development-scale-down-manual-xwp1k-xkkw9 to aks-default-37275452-vmss000002
2m12s Normal Pulling pod/development-scale-down-manual-xwp1k-xkkw9 Pulling image "myimage:latest"
2m11s Normal Pulled pod/development-scale-down-manual-xwp1k-xkkw9 Successfully pulled image "myimage:latest" in 751.93652ms
2m10s Normal Created pod/development-scale-down-manual-xwp1k-xkkw9 Created container myimage
2m10s Normal Started pod/development-scale-down-manual-xwp1k-xkkw9 Started container myimage
3m19s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-b858c
2m12s Normal SuccessfulCreate job/development-scale-down-manual-xwp1k Created pod: development-scale-down-manual-xwp1k-xkkw9
2m1s Warning BackoffLimitExceeded job/development-scale-down-manual-xwp1k Job has reached the specified backoff limit
I cant figure out why the container was killed, the logs all seem fine and there are no resource constraints. The container is removed very quickly meaning I have very little time to debug. The more verbose event line reads as such
3m54s Normal Killing pod/development-scale-down-manual-xwp1k-b858c spec.containers{myimage} kubelet, aks-burst-37275452-vmss00000d Stopping container myimage 3m54s 1 development-scale-down-manual-xwp1k-b858c.1697e9d5e5b846ef
I note that the image pull takes a good few seconds (40) initially, might this aid in exceeding the startingDeadline or another cron spec?
Any thoughts or help appreciated, thank you