I recall seeing this before and finding an answer before.
https://cloud.google.com/container-registry/docs/pulling-cached-images
Talks about it a little, but I'll explain it so it's easy to follow.
If I spin up a private GKE cluster and I create 3 deployments:
- 1st uses image: nginx:latest
- 2nd uses image: nginx:stable
- 3rd uses image: docker.io/busybox:1.36.0-glibc
nginx:latest (common tag) will almost always work
nginx:stable (popular tag) will work sometimes
The super specific tag (rarely used tag) will almost always fail with ImagePullBackOff
So why is this the case?
1. The ImagePullBackOff happens when the pods/nodes have no NAT Gateway/no Internet Access
kubectl exec -it working-nginx-latest-pod -- curl yahoo.com
^-- You can prove no internet with this, note curl google.com
is a bad test on GKE, because it's still reachable via googles internal network / you'll get a response, because google's network can reach google.com without having to go through the internet, that's why I recommend testing with a non google URL like yahoo.com
(Google's networking also occasionally does some counterintuitive / non-standard things, like route public IP Addresses over their internal network, so sometimes you can reach public IP addresses w/o internet access, it's usually google services with public IPs that are sometimes reachable w/o internet access.)
2. So the next question is, but wait... how are nginx:latest
and nginx:stable
able to pull image that exists on the internet/on docker hub, when there's no internet access? Basically why is it working for some images and not others?
Answer boils down to popularity of the image:tag pair. Is it popular enough to get cached in mirror.gcr.io?
The initial link I shared at the top mentions "Container Registry caches frequently-accessed public Docker Hub images on mirror.gcr.io", so basically if you reference a common tag of a popular image, you can sometimes get lucky enough to pull it even without internet, because the cache is reachable via private IP space / without internet access.
When a pod running on GKE private cluster gives you ImagePullBackOff, and you're like, what's going on? I know this image exists! docker pull docker.io/busybox:1.36.0-glibc
pulls fine from my local machine, what's happening is that rarely used tag doesn't exist in their cache, that mirrors common tags of popular images.
Best way to fix it is to either pull all images from pkg.dev (GCP's Artifact Registry, which GKE should be able to access w/o internet access) or set up NAT gateway/ensure the private cluster has internet access. And you can use kubectl exec -it working-nginx-latest-pod -- curl yahoo.com
as a feedback loop to check if the cluster has internet access as you tinker with VPC settings to add NAT GW.
https://cloud.google.com/kubernetes-engine/docs/best-practices/networking#use-cloudnat
mentions By default, (GKE) "private clusters don't have internet access. In order to allow Pods to reach the internet, enable Cloud NAT for each region. At a minimum, enable Cloud NAT for the primary and secondary ranges in the GKE subnet."