44

I'm monitoring docker containers via Prometheus.io. My problem is that I'm just getting cpu_user_seconds_total or cpu_system_seconds_total.

How to convert this ever-increasing value to a CPU percentage?

Currently I'm querying:

rate(container_cpu_user_seconds_total[30s])

But I don't think that it is quite correct (comparing to top).

How to convert cpu_user_seconds_total to CPU percentage? (Like in top)

Willi Mentzel
  • 27,862
  • 20
  • 113
  • 121
M156
  • 1,044
  • 1
  • 12
  • 29

4 Answers4

63

Rate returns a per second value, so multiplying by 100 will give a percentage:

rate(container_cpu_user_seconds_total[30s]) * 100

brian-brazil
  • 31,678
  • 6
  • 93
  • 86
  • 1
    From a blog post (also by @brian-brazil) I found a missing point: "these values always sum to one second per second for each cpu, [so] the per-second rates are also the ratios of usage." – Hamy Dec 30 '19 at 02:26
  • 2
    Here you are using a value of `[30s]`. In your blog post (https://www.robustperception.io/understanding-machine-cpu-usage), you mention a value of `[1m]`. Some users are using quite larger values. What's the difference and how to find the correct value? And what impact does `100 - (avg by (instance)` have? – mhellmeier Jun 29 '21 at 08:14
  • 14
    For idiots like me. The rate function over a cpu seconds counter reads "How many seconds did the cpu work every second?". 1 second every second on 1 core will be 100%. 3 cores at 50% will be 1.5 seconds every second, etc... The bracket is the averaging window, longer periods will flatten the graph. Tune according to how erratic your cpu usage is. – Tamir Daniely Nov 02 '21 at 09:02
46

I also found this way to get CPU Usage to be accurate:

100 - (avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100)

From: http://www.robustperception.io/understanding-machine-cpu-usage/

d4nyll
  • 11,811
  • 6
  • 54
  • 68
Christian Vielma
  • 15,263
  • 12
  • 53
  • 60
  • 1
    I don't know why, but I tested this on a local desktop with prometheus and a website running and it reported 99% CPU usage. Which is highly unlikely given how the system reports 10% CPU usage. – Nephilim May 09 '19 at 08:09
  • 2
    This for 0.16 and above version of node exporter for 0.15 and below version you can use ```100 * (1 - avg by(instance)(irate(node_cpu{job='node',mode='idle'}[5m])))``` – Sachin Arote Sep 23 '19 at 06:34
  • 3
    Does it sum all CPU cores? Why does Node Exporter Full Grafana dashboard use this instead? (((count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle',instance=~"$node:$port",job=~"$job"}[5m])))) * 100) / count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu)) – pablo Dec 07 '19 at 21:15
  • I would prefer using `rate()` instead of `irate()`, since `irate()` can be quite noisy - https://valyala.medium.com/why-irate-from-prometheus-doesnt-capture-spikes-45f9896d7832 – valyala Oct 11 '21 at 07:50
11

Note that container_cpu_user_seconds_total and container_cpu_system_seconds_total are per-container counters, which show CPU time used by a particular container in user space and in kernel space accordingly (see these docs for more details). Cadvisor exposes additional metric - container_cpu_usage_seconds_total. This metric equals to the sum of the container_cpu_user_seconds_total and container_cpu_system_seconds_total, e.g. it shows overall CPU time used by each container. See these docs for more details.

The container_cpu_usage_seconds_total is a counter, e.g. it increases over time. This isn't very informative for determining CPU usage at a particular time. Prometheus provides rate() function, which returns the average per-second increase rate over counters. For example, the followign query returns the average per-second increase of per-container container_cpu_usage_seconds_total metrics over the last 5 minutes (see 5m lookbehind window in square brackets):

rate(container_cpu_usage_seconds_total[5m])

This is basically the average number of CPU cores used during the last 5 minutes. Just multiply it by 100 in order to get CPU usage in %. Note that the resulting value may exceed 100% if the container uses more than a single CPU core during the last 5 minutes.

The rate(container_cpu_usage_seconds_total[5m]) usually returns a TON of time series with many long labels in production Kubernetes, so it is better to use the following queries:

The average number of CPU cores used during the last 5 minutes per each pod:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)

The average number of CPU cores used during the last 5 minutes per each node:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)

The average number of CPU cores used during the last 5 minutes per each namespace:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)

The container!="" filter removes superfluous metrics related to cgroups hierarchy - see this answer for more details.

valyala
  • 11,669
  • 1
  • 59
  • 62
3

For Windows Users - wmi_exporter

100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)
Aidar Gatin
  • 755
  • 1
  • 7
  • 9