0

context

Our current context is the following: researchers are running HPC calculations on our Kubernetes cluster. Unfortunately, some pods cannot get scheduled because the container engine (here Docker) is not able to pull the images because the node is running out of disk space.

hypotheses

images too big

The first hypothesis is that the images are too big. This probably the case because we know that some images are bigger than 7 GB.

datasets being decompressed locally

Our second hypothesis is that some people are downloading their datasets locally (e.g. curl ...) and inflate them locally. This would generate the behavior we are observing.

Envisioned solution

I believe that this problem is a good case for a daemon set that would have access to the node's file system. Typically, this pod would calculate the total disk space used by all the pods on the node and would expose them as a Prometheus metric. From there is would be easy to set alert rules in place to check which pods have grown a lot over a short period of time.

How to calculate the total disk space used by a pod?

The question then becomes: is there a way to calculate the total disk space used by a pod?

Does anyone have any experience with this?

E. Jaep
  • 2,095
  • 1
  • 30
  • 56
  • # Show metrics for a given pod and sort it by 'cpu' or 'memory' kubectl top pod --sort-by=memory # going thru folders yourself kubectl get pods -n default -o json | jq '.items[] | .metadata.name' | xargs -I {} sh -c "du -sh /var/i_dont_know_which_folder_is_default | awk '{print $1}'" # List PersistentVolumes sorted by capacity kubectl get pv --sort-by=.spec.capacity.storage – Bhaskar13 Dec 07 '22 at 17:37

1 Answers1

0

Kubernetes does not track overall storage available. It only knows things about emptyDir volumes and the filesystem backing those.

For calculating total disk space you can use below command

kubectl describe nodes

From the above output of the command you can grep ephemeral-storage which is the virtual disk size; this partition is also shared and consumed by Pods via emptyDir volumes, image layers,container logs and container writable layers.

Check where the process is still running and holding file descriptors and/or perhaps some space (You may have other processes and other file descriptors too not being released). Check Is that kubelet.

You can verify by running $ ps -Af | grep xxxx

With Prometheus you can calculate with the below formula

sum(node_filesystem_size_bytes)

Please go through Get total and free disk space using Prometheus for more information.

Veera Nagireddy
  • 1,656
  • 1
  • 3
  • 12