0

I am very new to Prometheus and have the following alert in Prometheus whose goal is to get triggered when number errors in the total number of requests is higher than 5 %:

sum(increase(errorMetric{service_name="someservice"}[5m])) /  sum(increase(http_requests_count{service_name="someservice", path="/some/path"}[5m])) > 0.05

I have an overall idea of the traffic and it can range between 100 requests per hour over 24h interval. How valuable is to have the interval set for 5m? Shall this range over a longer period of time, e.g. 1h. This alert goes off and it does not really inform us of a problem. What is your view?

Thank you

panza
  • 1,341
  • 7
  • 38
  • 68
  • Additional question: is there a strong reason as why I should use ```rate``` as opposed to ```increase```? – panza Nov 14 '22 at 12:51

2 Answers2

1

Buried in the mass Prometheus docs, there is a paragraph for increase function:

increase should only be used with counters and native histograms where the components behave like counters. It is syntactic sugar for rate(v) multiplied by the number of seconds under the specified time range window, and should be used primarily for human readability.

So answer your questions:

  1. Is there a strong reason as why I should use rate as opposed to increase?

    Yes, use the rate function.

  2. How valuable is to have the interval set for 5m?

    Not so valuable. Since your RPS/QPS is very small - less than 10 per 5m, you may get some 5m time ranges with little or zero requests and others with much more requests. The alert rule will be too sensitive or just wrong in a wider time range view. 30m or 1h range might be better.

By the way, time series on each side of division operator should have matching labels to make the alert rule work.

YwH
  • 1,050
  • 5
  • 11
  • Thanks, I always find that except from the Prometheus a bit vague for my own understanding. ```increase should only be used with counters and native histograms where the components behave like counters```. What are the cases in which counters do not behave like counters, then? This is where I am confused. In my case, that is a counter and indeed behaves as one, but I am not sure I am aware of a situation in which that does not apply. Can you help? – panza Nov 14 '22 at 19:04
  • In that respect, I think this question is a great explanation: https://stackoverflow.com/questions/54494394/do-i-understand-prometheuss-rate-vs-increase-functions-correctly – panza Nov 14 '22 at 23:18
1

It looks like you have e.g. slow-changing integer counter, which may increase by less than 100 during an hour. Prometheus can return unexpected results from increase() function when applied to slow-changing integer counters because of the following issues:

  • increase(m[d]) may return fractional results over integer counter m because of extrapolation. See this issue.
  • increase(m[d]) may miss counter increase between the last raw sample just before the lookbehind window d and the first raw sample inside the lookbehind window d. See this article for more details.
  • increase(m[d]) may miss the initial counter increase if m time series starts from value other than zero.

The same issues are applied to rate() as well, since increase() is a syntactic sugar over rate() in Prometheus, e.g. increase(m[d]) = rate(m[d]) * d.

It is recommended using longer lookbehind windows for rate() and increase() functions when they are applied to slow-changing counters, in order to minimize the significance of issues mentioned above. For example, to use 1h lookbehind window in square brackets instead of 5m, so the increased window catches non-zero counter increases.

As for the original query, it is better rewriting it to the following one:

(
  sum(increase(errorMetric{service_name="someservice"}[1h]))
    /
  sum(increase(http_requests_count{service_name="someservice"}[1h]))
) > 0.05

This query contains the following changes comparing to the original query:

  • 5m lookbehind window has been changed to 1h
  • the path label filter has been removed from the http_requests_count metric selector, so the number of errorMetric time series matches the number of http_requests_count time series. On the other hand, the path label filter could be added to errorMetric metric selector instead.
valyala
  • 11,669
  • 1
  • 59
  • 62