18

I run a v1.9.2 custom setup of Kubernetes and scrape various metrics with Prometheus v2.1.0. Among others, I scrape the kubelet and cAdvisor metrics.

I want to answer the question: "How much of the CPU resources defined by requests and limits in my deployment are actually used by a pod (and its containers) in terms of (milli)cores?"

There are a lot of scraped metrics available, but nothing like that. Maybe it could be calculated by the CPU usage time in seconds, but I don't know how.

I was considering it's not possible - until a friend told me she runs Heapster in her cluster which has a graph in the built-in Grafana that tells exactly that: It shows the indivual CPU usage of a pod and its containers in (milli)cores.

Since Heapster also uses kubelet and cAdvisor metrics, I wonder: how can I calculate the same? The metric in InfluxDB is named cpu/usage_rate but even with Heapster's code, I couldn't figure out how they calculate it.

Any help is appreciated, thanks!

Alex
  • 365
  • 1
  • 2
  • 6

2 Answers2

24

We're using the container_cpu_usage_seconds_total metric to calculate Pod CPU usage. This metrics contains the total amount of CPU seconds consumed by container by core (this is important, as a Pod may consist of multiple containers, each of which can be scheduled across multiple cores; however, the metric has a pod_name annotation that we can use for aggregation). Of special interest is the change rate of that metric (which can be calculated with PromQL's rate() function). If it increases by 1 within one second, the Pod consumes 1 CPU core (or 1000 milli-cores) in that second.

The following PromQL query does just that: Compute the CPU usage of all Pods (using the sum(...) by (pod_name) operation) over a five minute average:

sum(rate(container_cpu_usage_seconds_total[5m])) by (pod_name)
helmbert
  • 35,797
  • 13
  • 82
  • 95
  • Thanks, there is a StackOverflow question and answer with exactly your query, but this is not what I am looking for. – Alex Feb 19 '18 at 19:37
  • @Alex _"this is not what I am looking for"_ Sorry to hear that. What _are_ you looking for? – helmbert Feb 19 '18 at 19:44
  • 1
    I'm sorry, I might get you wrong. So after firing the query, I get a value like 1.4757899821777767. You said this is equivalent to ~1470 millicores. Is there some resource for me to check that might help me understand that relation? – Alex Feb 19 '18 at 20:43
  • 3
    Basically, cores and millicores are nothing else than an abstraction on CPU utilization. A CPU consumption of "1 core" (~1000 millicores) means that a Pod utilizes one CPU core by 100% (500 millicores means 50% and so on); CPU _limits_ work the same: For example, a limit of 250 millicores means that a Pod may utilize one CPU core by not more than 25%. Have a look at [the _"Managing Compute Resources for Containers"_ section in the docs](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) for more information. – helmbert Feb 19 '18 at 22:27
  • 2
    I don't see how _container_cpu_usage_seconds_total_ which should give how many seconds the CPU is used represents what question asks for - CPU utilization _in terms of (milli)cores_. If I have _container_cpu_usage_seconds_total_ equals to 100ms (milliseconds) how to calculate how many millicores is that? – Ventzy Kunev Aug 20 '19 at 19:56
  • 7
    @VentzyKunev The value of total CPU usage of 100ms is -- by itself -- not particularly meaningful. It DOES become useful if you know that your Pod used 100ms of CPU time _per second_ -- which is exactly what Prometheus's `rate` function does. The rest is just conversion between different units. A usage of 100ms of CPU time per 1000ms of _real time_ corresponds to a CPU usage of 10% (within that one second) -- or, alternatively of 1/10 CPU core (or 100 millicores). – helmbert Aug 20 '19 at 21:24
  • @helmbert Shouldn't we take the average, instead of sum, if there are multiple cores to get the CPU utilization? Because as I see it, there is a `cpu` label in `container_cpu_usage_seconds_total` which specifies usage per each core?! – today Dec 01 '20 at 17:30
  • 1
    @today You can do both, depending on what question you want your query to answer. When taking the average, you'll get your Pod's _average utilization per core per second_ (an information of arguable usefulness, since the knowing your number of cores is important for interpreting this value -- an avg utilization of ".5 on one core" is wildly different than ".5 on 64 cores"). Using `sum`, you'll get your Pod's _total utilization across all cores per second_ (with a value like ".5" if you've utilized one core by 50% or two by 25% or like "32" if you've utilized 64 cores by 50%). Hope that helps! – helmbert Dec 01 '20 at 17:55
  • @helmbert Thank you for your reply. You are right. Basically, the value given by `avg` is normalized by the number of cores, whereas `sum` is not. So I think if we want an overall estimate of CPU utilization (i.e. CPU usage percentage) then the `avg` would be a better choice; however, if we are interested to find the overall utilization of cores (i.e. how many cores are being utilized overall), then the `sum` would be a better choice. – today Dec 01 '20 at 18:20
  • 1
    Many thanks, I've been looking for this particular query for awhile. And the _explanation_ is really helpful. :-) – Rodney Gitzel Dec 09 '20 at 18:11
2

The following PromQL query returns per-pod number of used CPU cores starting from Kubernetes v1.16 and newer versions:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod)

The {container!=""} filter is needed for filtering out cgroups hierarchical stats, which is already included into per-container stats. See this answer for more details on this.

The following PromQL query must be used for Kubernetes below v1.16 because it uses different label names (e.g. container_name instead of container and pod_name instead of pod - see this issue for details):

sum(rate(container_cpu_usage_seconds_total{container_name!=""}[5m])) by (pod_name)
valyala
  • 11,669
  • 1
  • 59
  • 62