3

We had a major outage when both our container registry and the entire K8S cluster lost power. When the cluster recovered faster than the container registry, my pod (part of a statefulset) is stuck in Error: ImagePullBackOff.

Is there a config setting to retry downloading the image from the CR periodically or recover without manual intervention?

I looked at imagePullPolicy but that does not apply for a situation when the CR is unavailable.

Wytrzymały Wiktor
  • 11,492
  • 5
  • 29
  • 37
ucipass
  • 923
  • 1
  • 8
  • 21
  • 1
    It does retry periodically, but the frequency that it tries decreases the more failures it has. – jordanm Feb 24 '22 at 16:14
  • Is there a way to control this or at least get a status about when it tries next? Otherwise, I assume this is hidden somewhere – ucipass Feb 24 '22 at 16:37
  • `kubectl get events -A` will show when its attempted. Not sure about figuring out next pull – jordanm Feb 24 '22 at 16:41
  • Maybe if you configure the deployment via some CI/CD tool, you can check at the pipeline level if the deployment fails, you can create another deployment that points to a mirror registry...or use a service like `Harbor` https://goharbor.io/ – Hackerman Feb 24 '22 at 17:18

1 Answers1

5

The BackOff part in ImagePullBackOff status means that Kubernetes is keep trying to pull the image from the registry, with an exponential back-off delay (10s, 20s, 40s, …). The delay between each attempt is increased until it reaches a compiled-in limit of 300 seconds (5 minutes) - more on it in Kubernetes docs.

backOffPeriod parameter for the image pulls is a hard-coded constant in Kuberenets and unfortunately is not tunable now, as it can affect the node performance - otherwise, it can be adjusted in the very code for your custom kubelet binary. There is still ongoing issue on making it adjustable.

anarxz
  • 817
  • 1
  • 14
  • 2
    Thank you very much for the comprehensive answer! Really appreciate the links to the source code too! I think there's a potential bug related to the MaxContainerBackOff hard coded 300s as I had to restart my Statefulset to trigger the image download after over 20 minutes. – ucipass Feb 26 '22 at 16:44