16

I have searched many websites and articles but not found any perfect answer. I am using eks version 1.18. I can see a few of the pods are "Evicted", but when trying to check the node I can see the error "(combined from similar events): failed to garbage collect required amount of images. Wanted to free 6283487641 bytes, but freed 0 bytes".

Is there any way we can find the reason why it's failing? or how to fix this issue? Any suggestions are most welcome.

enter image description here

I can see the disk "overlays" filesystem is almost full within a few hours. I am not sure what's going on. The below screenshot shows my memory utilization.

enter image description here

JDGuide
  • 6,239
  • 12
  • 46
  • 64
  • 2
    As you don't really provide any context of your isssue its very hard to advise anything. Looking at similar issues described [here](https://github.com/kubernetes/kubernetes/issues/71869) this might related to node disk pressure or some ebs storage. – acid_fuji Mar 31 '21 at 10:04
  • Thanks, Thomas. Actually, I have 5 nodes running on EKS. Each node contains around 10-12 pods, but when I am checking the nodes I can see the error as above. Also, I found after few days there are many evicted pods. It seems the memory issue and the event seems "FreeDiskSpaceFailed". If you are looking any specific config to share, please let me know. – JDGuide Mar 31 '21 at 10:38
  • Have you deleted evicted pods? Did you check kubelet log? There might be some information on why deleting failed. – anemyte Apr 06 '21 at 05:35
  • I have deleted the Evicted the pods. Which logs , any specific logs or location to see ? – JDGuide Apr 06 '21 at 05:51
  • I found the logs, but not found any specific logs related to this error. Trying to update the AMI minor version,lets see. – JDGuide Apr 06 '21 at 09:11
  • 1
    Do you have any Pods logging a lot? Your containers logs (for example docker logs) may take a lot of space if that is the case, I saw it happens once, so I would check just to be safe. If the root disk has pressure, Pods are Evicted to free space in an attempt to recover before reaching total failure with total disk space occupied. – AndD Apr 11 '21 at 09:43
  • @AndD yeah there some jobs that created a problem with space. But not sure whether those are the issue. Checking that part. Thanks. – JDGuide Apr 12 '21 at 07:34

4 Answers4

1

see if you can change the Kubernetes GC policies. I guess the issues may be due to recent changes in the flags

the new ones are using the flags as --eviction syntax, can you check if that is the case with your setup causing the failure on clearing the space

Please refer to the docs here

https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/

NBaua
  • 583
  • 4
  • 16
  • 33
1

My local k3d cluster had the same issue, it turned out I was low on space and I had a ton of dangling images https://docs.docker.com/engine/reference/commandline/image_prune/ and running docker image prune -a and recreating the cluster fixed it for me.

lordvcs
  • 2,466
  • 3
  • 28
  • 37
  • Yes, but what if there is containerd in place of docker? – mirekphd Jul 06 '23 at 11:09
  • 1
    @mirekphd https://stackoverflow.com/questions/64460740/prune-container-images-with-just-containerd-w-o-docker hope this helps – lordvcs Jul 08 '23 at 07:48
  • 1
    Thanks! I've tested `crictl rmi --prune` and it just removes unused images (non used in running containers), so the chances for improvement there are slim compared to `docker prune`. – mirekphd Jul 08 '23 at 07:53
  • for me, I had tens of unused images over the years, and that made a huge difference. Clearing disk space by removing other unused files should help too – lordvcs Jul 08 '23 at 08:15
1

So a workaround that can stabilize situation for a while (giving your time to mount a larger volume for storing images) is to to start using local images cache, by setting in your Deployment (or Pod) manifest:

spec.containers.imagePullPolicy: "ifNotPresent"

One situation I've encountered where such quick storage exhaustion can happen is when you set imagePullPolicy to Always and then the image fails to pull completely (one reason being that there is not enough space). k8s then enters into an image pull loop (not throttled sufficiently by the backoff mechanism) and those unique incomplete image parts with different checksums combined with the "always pull" request can quickly consume all available storage dedicated to docker images (on the partition where containerd is located).

mirekphd
  • 4,799
  • 3
  • 38
  • 59
-1

Simply. In my case, the disk was almost full on reported node.

Check if node has disk pressure:

kubectl describe node node-x

Check pods on that node:

kubectl get pods -A -o wide | grep node-x

Access each pod and check df -m

kubectl exec -it pod_name sh

Some tips:

  • depending on K8s setup you may can focus on root / file system on pods and node-x, so space should be reduced

  • you can change node-x to node-y and compare how they differ in space by accessing these nodes and their pods (in case node-y is healthy)

  • try to clean up space on node-x via SSH, maybe Docker is occupying the disk? Quick tips: docker image prune -a --filter "until=48h" - remove unused images, clean up old logs journalctl --vacuum-time=2d, etc.

  • check kubectl logs each_pod_on_node_x if some pod written too many lines

laimison
  • 1,409
  • 3
  • 17
  • 39
  • why is this answer downvoted? my issue was also the disk being almost full. It would help everyone if the downvoters also left a comment – lordvcs Jul 08 '23 at 07:47
  • I can only guess that it didn't help someone so he/she pressed down vote. But in really, not all issues have only one solution case. Also important fact to not miss, that disk is not completely full and issue kicks in earlier. – laimison Jul 10 '23 at 10:46