Note that container_cpu_user_seconds_total
and container_cpu_system_seconds_total
are per-container counters, which show CPU time used by a particular container in user space
and in kernel space
accordingly (see these docs for more details). Cadvisor exposes additional metric - container_cpu_usage_seconds_total
. This metric equals to the sum of the container_cpu_user_seconds_total
and container_cpu_system_seconds_total
, e.g. it shows overall CPU time used by each container. See these docs for more details.
The container_cpu_usage_seconds_total
is a counter, e.g. it increases over time. This isn't very informative for determining CPU usage at a particular time. Prometheus provides rate() function, which returns the average per-second increase rate over counters. For example, the followign query returns the average per-second increase of per-container container_cpu_usage_seconds_total
metrics over the last 5 minutes (see 5m
lookbehind window in square brackets):
rate(container_cpu_usage_seconds_total[5m])
This is basically the average number of CPU cores used during the last 5 minutes. Just multiply it by 100 in order to get CPU usage in %. Note that the resulting value may exceed 100% if the container uses more than a single CPU core during the last 5 minutes.
The rate(container_cpu_usage_seconds_total[5m])
usually returns a TON of time series with many long labels in production Kubernetes, so it is better to use the following queries:
The average number of CPU cores used during the last 5 minutes per each pod:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)
The average number of CPU cores used during the last 5 minutes per each node:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
The average number of CPU cores used during the last 5 minutes per each namespace:
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
The container!=""
filter removes superfluous metrics related to cgroups
hierarchy - see this answer for more details.