AFAICT the cause for the weird results is (1) the fact that your counter actually only increases once every minute, even though you collect it every 15 seconds combined with (2) Prometheus' rate()
implementation discarding every 4th counter increase (in your particular setup).
More precisely, you appear to be computing a 1 minute rate, every 1 minute over a counter scraped at 15 second resolution, increasing every 1 minute (on average).
What this means essentially is that Prometheus will basically slice your 1 hour interval into disjoint 1 minute ranges and estimate the rate over each range. The first value will be the extrapolated rate of increase between points 0 and 3, the second will be the extrapolated rate between points 4 and 7 and so on. Because your counter only actually increases once a minute, you can run into 2 different situations:
- Your counter increases happen between point pairs 3-4, 7-8 etc. In this case Prometheus sees an increase rate of zero (because there is no increase between points 0 and 3, points 4 and 7 etc. This seems to be happening in the first half of your first graph.
- Your counter increases happen somewhere between points 0-3, 4-7 etc. In this case Prometheus takes the difference between the last and first points in each interval (your actual counter increase), divides it by the time difference between the 2 points (on average 45 seconds), then extrapolates that to 1 minute (essentially overestimating it by a factor of 1.(3) -- I'm eyeballing an increase of ~200k over ~50 minutes, so an average rate of about 67 QPS, whereas
rate()
returns something closer to 90 QPS). This is what happens in the second half of your graph.
This is also why your graph looks wildly different across refreshes. The argument for the current implementation of rate()
is that it is "correct on average". Which, if you look at the whole of your graph, across refreshes, is true. </sarcasm>
Essentially graphing a Prometheus rate()
or increase()
over a time range R with resolution R will result in aliasing, either overestimating (1.33x in your case) or underestimating (zero in your case) on anything but a smoothly increasing counter.
You can work around it by replacing your expression with:
rate(foo[75s]) / 75 * 60
This way you'll actually get the rate of increase between data points 1 minute apart (a 75 seconds range will almost always return exactly 5 points, so 4 counter increases) and reverse the extrapolation to 75 seconds that Prometheus does. There will be some noise in edge cases (e.g. if your evaluation is aligned with scraping times it's possible to get 6 points in one range and 4 in the next due to scrape interval jitter) but you're getting that anyway with rate()
.
BTW, you can see the aliasing by increasing the resolution of your graph to something like 1 second (anything 15 seconds or below should show it clearly).