Prometheus - Convert cpu_user_seconds to CPU Usage %?

Question

I'm monitoring docker containers via Prometheus.io. My problem is that I'm just getting cpu_user_seconds_total or cpu_system_seconds_total.

How to convert this ever-increasing value to a CPU percentage?

Currently I'm querying:

rate(container_cpu_user_seconds_total[30s])

But I don't think that it is quite correct (comparing to top).

How to convert cpu_user_seconds_total to CPU percentage? (Like in top)

score 63 · Accepted Answer · answered Jan 21 '16 at 17:36

63

Rate returns a per second value, so multiplying by 100 will give a percentage:

rate(container_cpu_user_seconds_total[30s]) * 100

answered Jan 21 '16 at 17:36

brian-brazil

31,678
6
93
86

1

From a blog post (also by @brian-brazil) I found a missing point: "these values always sum to one second per second for each cpu, [so] the per-second rates are also the ratios of usage." – Hamy Dec 30 '19 at 02:26
2

Here you are using a value of `[30s]`. In your blog post (https://www.robustperception.io/understanding-machine-cpu-usage), you mention a value of `[1m]`. Some users are using quite larger values. What's the difference and how to find the correct value? And what impact does `100 - (avg by (instance)` have? – mhellmeier Jun 29 '21 at 08:14
14

For idiots like me. The rate function over a cpu seconds counter reads "How many seconds did the cpu work every second?". 1 second every second on 1 core will be 100%. 3 cores at 50% will be 1.5 seconds every second, etc... The bracket is the averaging window, longer periods will flatten the graph. Tune according to how erratic your cpu usage is. – Tamir Daniely Nov 02 '21 at 09:02

score 46 · Answer 2 · edited Feb 17 '19 at 09:30

46

I also found this way to get CPU Usage to be accurate:

100 - (avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100)

From: http://www.robustperception.io/understanding-machine-cpu-usage/

edited Feb 17 '19 at 09:30

d4nyll

11,811
6
54
68

answered Aug 11 '16 at 11:09

Christian Vielma

15,263
12
53
60

1

I don't know why, but I tested this on a local desktop with prometheus and a website running and it reported 99% CPU usage. Which is highly unlikely given how the system reports 10% CPU usage. – Nephilim May 09 '19 at 08:09
2

This for 0.16 and above version of node exporter for 0.15 and below version you can use ```100 * (1 - avg by(instance)(irate(node_cpu{job='node',mode='idle'}[5m])))``` – Sachin Arote Sep 23 '19 at 06:34
3

Does it sum all CPU cores? Why does Node Exporter Full Grafana dashboard use this instead? (((count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle',instance=~"$node:$port",job=~"$job"}[5m])))) * 100) / count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu)) – pablo Dec 07 '19 at 21:15
I would prefer using `rate()` instead of `irate()`, since `irate()` can be quite noisy - https://valyala.medium.com/why-irate-from-prometheus-doesnt-capture-spikes-45f9896d7832 – valyala Oct 11 '21 at 07:50

score 11 · Answer 3 · answered Apr 01 '22 at 15:40

Note that container_cpu_user_seconds_total and container_cpu_system_seconds_total are per-container counters, which show CPU time used by a particular container in user space and in kernel space accordingly (see these docs for more details). Cadvisor exposes additional metric - container_cpu_usage_seconds_total. This metric equals to the sum of the container_cpu_user_seconds_total and container_cpu_system_seconds_total, e.g. it shows overall CPU time used by each container. See these docs for more details.

The container_cpu_usage_seconds_total is a counter, e.g. it increases over time. This isn't very informative for determining CPU usage at a particular time. Prometheus provides rate() function, which returns the average per-second increase rate over counters. For example, the followign query returns the average per-second increase of per-container container_cpu_usage_seconds_total metrics over the last 5 minutes (see 5m lookbehind window in square brackets):

rate(container_cpu_usage_seconds_total[5m])

This is basically the average number of CPU cores used during the last 5 minutes. Just multiply it by 100 in order to get CPU usage in %. Note that the resulting value may exceed 100% if the container uses more than a single CPU core during the last 5 minutes.

The rate(container_cpu_usage_seconds_total[5m]) usually returns a TON of time series with many long labels in production Kubernetes, so it is better to use the following queries:

The average number of CPU cores used during the last 5 minutes per each pod:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)

The average number of CPU cores used during the last 5 minutes per each node:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)

The average number of CPU cores used during the last 5 minutes per each namespace:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)

The container!="" filter removes superfluous metrics related to cgroups hierarchy - see this answer for more details.

score 3 · Answer 4 · answered Mar 24 '20 at 20:39

3

For Windows Users - wmi_exporter

100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)

answered Mar 24 '20 at 20:39

Aidar Gatin

755
1
7
9

Prometheus - Convert cpu_user_seconds to CPU Usage %?

4 Answers4

The average number of CPU cores used during the last 5 minutes per each pod:

The average number of CPU cores used during the last 5 minutes per each node:

The average number of CPU cores used during the last 5 minutes per each namespace:

Linked