It looks like you have e.g. slow-changing integer counter, which may increase by less than 100 during an hour. Prometheus can return unexpected results from increase()
function when applied to slow-changing integer counters because of the following issues:
increase(m[d])
may return fractional results over integer counter m
because of extrapolation. See this issue.
increase(m[d])
may miss counter increase between the last raw sample just before the lookbehind window d
and the first raw sample inside the lookbehind window d
. See this article for more details.
increase(m[d])
may miss the initial counter increase if m
time series starts from value other than zero.
The same issues are applied to rate()
as well, since increase()
is a syntactic sugar over rate()
in Prometheus, e.g. increase(m[d]) = rate(m[d]) * d
.
It is recommended using longer lookbehind windows for rate()
and increase()
functions when they are applied to slow-changing counters, in order to minimize the significance of issues mentioned above. For example, to use 1h
lookbehind window in square brackets instead of 5m
, so the increased window catches non-zero counter increases.
As for the original query, it is better rewriting it to the following one:
(
sum(increase(errorMetric{service_name="someservice"}[1h]))
/
sum(increase(http_requests_count{service_name="someservice"}[1h]))
) > 0.05
This query contains the following changes comparing to the original query:
5m
lookbehind window has been changed to 1h
- the
path
label filter has been removed from the http_requests_count
metric selector, so the number of errorMetric
time series matches the number of http_requests_count
time series. On the other hand, the path
label filter could be added to errorMetric
metric selector instead.