In my case the problem was the nodes were filling up with docker images.
Some of them unused and never pruned and others way too big.
To confirm it, you first have to ssh to the node and check if the disk is (nearly) full.
For instance:
[root@node-name ~]# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 20G 15G 5.9G 71% /
It's possible to findout which image specifically occupies to most space and I recommend to do so.
Check this excellent resource to see how to:
https://rharshad.com/eks-troubleshooting-disk-pressure/
Knowing which image takes the most space and investigating its file system to know why can be useful to optimize image size, but that's a different topic.
If you can't add more storage to the node it's possible to clean it up with docker prune.
But before we need to make sure no containers are running, so let’s drain the node first:
kubectl drain node-name
Note that the node will be cordoned after it’s drained, this means no containers will be scheduled to it.
Back inside the node let’s prune the unused docker resources:
[root@node-name ~]# docker system prune --all
WARNING! This will remove:
- all stopped containers
- all networks not used by at least one container
- all images without at least one container associated to them
- all build cache
Are you sure you want to continue? [y/N] y
Deleted Containers:
8333683571a2ceff47bf08cc254f8fa3809acacc7fb981be3c1c274e9465dd68
28bdc62425707127ac977d20fd3dc85374ffc54ccccf2b2f2098d9af9ca3c898
7315014bfd9207c5a1b8e76ef0f1567bb5e221de6fe0304f4728218abd7e1f3f
b0f5ecb854a9f4b41610d7ec5b556447600f57529e68ae2093d1d40df02ff214
9e24227321d5e151bc665c55bcd474c9d586857cbac3cad744aad2dc11729e5e
63ab1bf7ded78d4b77db22f9c1aaac6a55247c71ca55b51caa8492f2b16c4d69
...
Total reclaimed space: 4.529GB
Then check the storage space again:
[root@node-name ~]# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 20G 8.9G 12G 45% /
Now let’s put the node back to a ready state using the kubectl command from the host:
rancher kubectl uncordon node-name