8

Say I have a metrics request_failures for users. For each user I add a unique label value to the metrics. So for user u1, when a request failed twice, I get the following metrics:

    request_failures{user_name="u1"} 2

I also have a rule that fires when there are new failures. Its expression is:

    increase(request_failures[1m]) > 0

This works well for a user that already encountered failures. For example, when u1 encounters the third failure, the rule fires.

When a request failed for a new user u2, I get the metrics as:

    request_failures{user_name="u1"} 2
    request_failures{user_name="u2"} 1

Now the problem is that the alert rule doesn't fire for u2. It seems that the rule cannot recognize a "new metrics", although all the three metrics are identically request_failures, just with different labels.

Anyone can point out how I should construct the rule?

Michael Doubez
  • 5,937
  • 25
  • 39
Jay Xue
  • 81
  • 1
  • 3
  • Do you mean that the metric exists only when user has at least one failure and your expressions alerts only when the failure increases so you don't detect new failures ? – Michael Doubez Sep 15 '20 at 11:31
  • When a new failure (for a new user) occurs, a new metric is created. So yes, the metric (with a specific label for the user) exists only then the user has at least one failure. At yes, my expression alerts only when the failure increases from 1. The problem is that when the failure occurs (metric changes from 0 to 1), there is no alert. – Jay Xue Sep 15 '20 at 13:52
  • Sorry but just to be precise - from Prometheus point of viex, do you create a new metric or a new label value ? Your question could apply to either and the answers are widely different. – Michael Doubez Sep 16 '20 at 06:46
  • My intention is to create the same metric (with the same name "request_failures") but with a new label. It appears that from Prometheus's point of view, there is a "new metric". I put an answer yesterday (see below), but my preference is to use label instead of annotation. So I'd appreciate if you could suggest an approach that I can keep using label for different users but detect increase of the metric. – Jay Xue Sep 16 '20 at 13:11

3 Answers3

4

As already put by @MichaelDoubez , increase() does not consider newly created metric as a value increase. Unfortunately, same goes for changes(). There are reasons for that, such as a missing scrape for example, but it still can be solved with a query.

increase(request_failures[10m]) > 0
or
( request_failures unless request_failures offset 10m )

The second part (beginning with or) will fire for 10 minutes (defined by the offset) when there is a new metric.

anemyte
  • 17,618
  • 1
  • 24
  • 45
2

The reason the rule doesn't fire is that the increase() function doesn't consider a counter newly created to be 0 before the first scrape. I didn't find any source on that but it seems to be the case.

Therefore you want to detect two cases:

  • if a user has an issue while he doesn't have one before
  • if a user has a new issue in the last N minutes

This can be rephrased in the opposite logic:

a alert should be triggered for a user with errors unless there was no increase in errors in the last N minutes for this user

Which readily translates into the following promql:

rule: request_failures > 0 UNLESS increase(request_failures[1m]) == 0

On hindsight, regarding the increase() function, it cannot assume the previous value is 0 because it is expressed inside a range. The previous value may be out of range and not equal to 0. So it makes sense to have at least two points to have a value.

Michael Doubez
  • 5,937
  • 25
  • 39
  • This seems to be a very promising solution. I'll try it and get back to you. Thanks a lot! – Jay Xue Sep 22 '20 at 01:19
  • Hi Micheal, thanks for your suggestion. Unfortunately it does not work either. I wonder if you have tried this in a a live system? The expression `increase(request-failures[1m]) == 0` still doesn't work as expected for a "new" metric. In my testing the rule does not fire when the first error occurs, and only fires when the second error occurs. Further comments are appreciated. – Jay Xue Sep 24 '20 at 17:47
  • Hi Michael, any further suggestions/comments? – Jay Xue Sep 30 '20 at 16:53
  • Hello, Yes I tested it. What do you mean it doesn't work ? If your data points are too far from each other the increase expression will only exists for a short time which may not be enough for the alert to fire. Try to increase the 1m or add an offset to it. You may have to tinker a bit with it depending on your polling rate and any `for` clause you have in your alert. – Michael Doubez Oct 01 '20 at 07:36
  • Hi Michael, by not working I mean I still don't get triggered when the error occurs the first time (i.e., the metric is from 0 to 1). In this case, in principle the alert should always be triggered no matter how long I wait. Let me double check if changing that to 5m or 10m would make any difference. Thanks. – Jay Xue Oct 09 '20 at 13:20
  • @JayXue Did you find and solution? Does extend the time to 5m made any differences? I am facing the same issue also. – kandarp Mar 08 '21 at 12:58
  • @MichaelDoubez any suggestion you have? I faced the same issue and still no success. – kandarp Mar 09 '21 at 20:32
  • any solution to this? – Paul Praet Sep 16 '22 at 11:20
0

This should be the answer: https://www.robustperception.io/dont-put-the-value-in-alert-labels.

The key is that the label should not include variable values as it is a part of the identity of a metric. The solution is to add username as annotation instead of label of a metric.

Jay Xue
  • 81
  • 1
  • 3