Prometheus rate functions and interval selections

Question

I am doing some monitoring with prometheus and is trying to understand how to properly use the rate functions.

Premise is this; I have a counter, configuration for this is set to ingest new values every 15s.

Now I am trying to graph the per second rate of this, so using the rate function I do this as:

rate(pgbouncer_sent_bytes_total{job="pgbouncer", database="worker"}[1m])

As I interpret the rate function, the query will give me a rolling rate average (in 1m look back windows) at each point in time that is queried. The interval of points is appointed by the resolution used.

Below is a screenshot from the prometheus console including the raw data graph and the plot from the rate query above using a 1m resolution. Now the resulting rate graph here does not really match my expectations looking at the raw data in the bottom graph.

data graphs

The interesting bit it also that the resulting graph will look very different depending on the point in time it is loaded. Simply reloading the same graph a couple of subsequent times will completely shift the looks to a point where it does not even looks as it is representing the same data. Image below is the same dataset a few minutes after, but the same occurs even seconds after.

rate reloaded graph

Could someone shed some light on what is really going on here?

I also notices that the rate calculation is jumpy and varries with reloads. The rate calculation is more complex than just looking at the slope of the first and last measurement in the intervall - maybe another mothod should be provided as well. https://github.com/prometheus/prometheus/blob/master/promql/functions.go#L50 — eckes, Feb 01 '18 at 11:05

score 30 · Answer 1 · answered Apr 04 '18 at 15:10

AFAICT the cause for the weird results is (1) the fact that your counter actually only increases once every minute, even though you collect it every 15 seconds combined with (2) Prometheus' rate() implementation discarding every 4th counter increase (in your particular setup).

More precisely, you appear to be computing a 1 minute rate, every 1 minute over a counter scraped at 15 second resolution, increasing every 1 minute (on average).

What this means essentially is that Prometheus will basically slice your 1 hour interval into disjoint 1 minute ranges and estimate the rate over each range. The first value will be the extrapolated rate of increase between points 0 and 3, the second will be the extrapolated rate between points 4 and 7 and so on. Because your counter only actually increases once a minute, you can run into 2 different situations:

Your counter increases happen between point pairs 3-4, 7-8 etc. In this case Prometheus sees an increase rate of zero (because there is no increase between points 0 and 3, points 4 and 7 etc. This seems to be happening in the first half of your first graph.
Your counter increases happen somewhere between points 0-3, 4-7 etc. In this case Prometheus takes the difference between the last and first points in each interval (your actual counter increase), divides it by the time difference between the 2 points (on average 45 seconds), then extrapolates that to 1 minute (essentially overestimating it by a factor of 1.(3) -- I'm eyeballing an increase of ~200k over ~50 minutes, so an average rate of about 67 QPS, whereas rate() returns something closer to 90 QPS). This is what happens in the second half of your graph.

This is also why your graph looks wildly different across refreshes. The argument for the current implementation of rate() is that it is "correct on average". Which, if you look at the whole of your graph, across refreshes, is true. </sarcasm>

Essentially graphing a Prometheus rate() or increase() over a time range R with resolution R will result in aliasing, either overestimating (1.33x in your case) or underestimating (zero in your case) on anything but a smoothly increasing counter.

You can work around it by replacing your expression with:

rate(foo[75s]) / 75  * 60

This way you'll actually get the rate of increase between data points 1 minute apart (a 75 seconds range will almost always return exactly 5 points, so 4 counter increases) and reverse the extrapolation to 75 seconds that Prometheus does. There will be some noise in edge cases (e.g. if your evaluation is aligned with scraping times it's possible to get 6 points in one range and 4 in the next due to scrape interval jitter) but you're getting that anyway with rate().

BTW, you can see the aliasing by increasing the resolution of your graph to something like 1 second (anything 15 seconds or below should show it clearly).

This is a great explanation of the underlying dynamics of the rate() function — redlus, May 14 '18 at 14:44
This is a very good explanation. I have also found Rate results to be highly unintuitive. There’s some lengthy discussion about why rate behaves the way it does and (sadly) why Prometheus doesn’t consider fixing it/providing a more intuitive alternative here: https://github.com/prometheus/prometheus/issues/3746 — Johannes Rudolph, Apr 09 '19 at 04:42

score 2 · Answer 2 · answered Aug 12 '16 at 10:09

2

What you say doesn't line up with the data, that raw data is only going up about once a minute. Are you sure you're scraping every 15s?

answered Aug 12 '16 at 10:09

brian-brazil

31,678
6
93
86

Yes, it becomes more apparent when zooming in however. Also this would yield that simply scaling up the range selection / resolution would solve it. It does not. I included an example series, the same problem exist in all series I have really. – Pelleplutt Aug 12 '16 at 11:05
2

The problem is your scraping. A 1 minute scrape interval combined with a 1 minute range is going to be highly susceptible to races. – brian-brazil Aug 12 '16 at 14:52

score 0 · Answer 3 · answered Apr 01 '22 at 15:10

The rate() function in Prometheus can miss some increases for slow-changing time series as Alin explained in this answer. See also this issue. Prometheus developers are going to fix this in the near future according to Alin's design doc.

There is a workaround though - to use rate() function from MetricsQL. It is free from issues mentioned above, so it should return the expected results for both fast-changing counters and slow-changing counters. See technical details here and here.

Prometheus rate functions and interval selections

3 Answers3

Linked