5

This is a very wried thing.

I created a private GKE cluster with a node pool of 3 nodes. Then I have a replica set with 3 pods. some of these pods will be scheduled to one node.

So one of these pods always get ImagePullBackOff, I check the error

Failed to pull image "bitnami/mongodb:3.6": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

And the pods scheduled to the remaining two nodes work well.

I ssh to that node, run docker pull and everything is fine. I cannot find another way to troubleshoot this error.

I tried to drain or delete that node and let the cluster to recreate the node. but it is still not working.

Help me, please.

Update: From GCP documentation, it will fail to pull images from the docker hub.

BUT the weirdest thing is ONLY ONE node is unable to pull the images.

Wytrzymały Wiktor
  • 11,492
  • 5
  • 29
  • 37
Chao
  • 865
  • 8
  • 21
  • Has anyone gotten anywhere on this? I don't understand how GKE is tractable if it cannot use public Docker Hub images. – Dmitry M Aug 06 '20 at 18:43
  • My Answer explains what's going on, but I forgot to address "BUT the weirdest thing is ONLY ONE node is unable to pull the images.", so I'll address in comment. If some nodes were able to pull and then other nodes (ONE node) suddenly wasn't able to pull, it means the image was in the cache mirror.gcr.io (when the other nodes pulled it) and then was removed from the cache. (You can't depend on cached images/no guarantees of them staying in cache.) – neoakris May 22 '23 at 04:13

2 Answers2

1

There was a related reported bug in Kubernetes 1.11

Make sure it is not your case

Wytrzymały Wiktor
  • 11,492
  • 5
  • 29
  • 37
Meir Tseitlin
  • 1,878
  • 2
  • 17
  • 28
1

I recall seeing this before and finding an answer before.

https://cloud.google.com/container-registry/docs/pulling-cached-images
Talks about it a little, but I'll explain it so it's easy to follow.

If I spin up a private GKE cluster and I create 3 deployments:

  • 1st uses image: nginx:latest
  • 2nd uses image: nginx:stable
  • 3rd uses image: docker.io/busybox:1.36.0-glibc

nginx:latest (common tag) will almost always work
nginx:stable (popular tag) will work sometimes
The super specific tag (rarely used tag) will almost always fail with ImagePullBackOff

So why is this the case?
1. The ImagePullBackOff happens when the pods/nodes have no NAT Gateway/no Internet Access
kubectl exec -it working-nginx-latest-pod -- curl yahoo.com
^-- You can prove no internet with this, note curl google.com is a bad test on GKE, because it's still reachable via googles internal network / you'll get a response, because google's network can reach google.com without having to go through the internet, that's why I recommend testing with a non google URL like yahoo.com
(Google's networking also occasionally does some counterintuitive / non-standard things, like route public IP Addresses over their internal network, so sometimes you can reach public IP addresses w/o internet access, it's usually google services with public IPs that are sometimes reachable w/o internet access.)

2. So the next question is, but wait... how are nginx:latest and nginx:stable able to pull image that exists on the internet/on docker hub, when there's no internet access? Basically why is it working for some images and not others?
Answer boils down to popularity of the image:tag pair. Is it popular enough to get cached in mirror.gcr.io?

The initial link I shared at the top mentions "Container Registry caches frequently-accessed public Docker Hub images on mirror.gcr.io", so basically if you reference a common tag of a popular image, you can sometimes get lucky enough to pull it even without internet, because the cache is reachable via private IP space / without internet access.

When a pod running on GKE private cluster gives you ImagePullBackOff, and you're like, what's going on? I know this image exists! docker pull docker.io/busybox:1.36.0-glibc pulls fine from my local machine, what's happening is that rarely used tag doesn't exist in their cache, that mirrors common tags of popular images.

Best way to fix it is to either pull all images from pkg.dev (GCP's Artifact Registry, which GKE should be able to access w/o internet access) or set up NAT gateway/ensure the private cluster has internet access. And you can use kubectl exec -it working-nginx-latest-pod -- curl yahoo.com as a feedback loop to check if the cluster has internet access as you tinker with VPC settings to add NAT GW.

https://cloud.google.com/kubernetes-engine/docs/best-practices/networking#use-cloudnat
mentions By default, (GKE) "private clusters don't have internet access. In order to allow Pods to reach the internet, enable Cloud NAT for each region. At a minimum, enable Cloud NAT for the primary and secondary ranges in the GKE subnet."

neoakris
  • 4,217
  • 1
  • 30
  • 32
  • 1
    What a nerve rack this was! I was going crazy wondering why some images would pull and others wouldn't. Creating a CloudNAT + Router solved the problem for me right away. I was using the default vpc and just assumed things would just work there no problem, but default vpc does not come default with a CloudNAT attached. – Nick Jun 14 '23 at 20:49