1

I have a Grafana dashboard, where I try to plot some of the prometheus metrics.

I am trying to calculate the average response time for 2 endpoints using the formula:

http_request_duration_seconds_sum / http_request_duration_seconds_count

but when plotting the query into the Grafana graph panel, I get 4 graphs (2 for each) instead of only 2, which I don't understand.

snippet from Grafana

Can anyone tell me, why there are 4 curves instead of 2? The two on the top are from the same query and likewise for the two in the buttom.

UPDATE

Is adding

sum(rate(http_request_duration_sum))[24h] / sum(rate(http_request_duration_count))[24h] 

the answer? That gives me 2 curves instead of 4, but not sure if the result is what I am looking for (being the average response time for the endpoint).

nelion
  • 1,712
  • 4
  • 17
  • 37

3 Answers3

7

The http_request_duration_sum and http_request_duration_count are metrics of counter type, so they usually increase over time and may sometimes reset to zero (for instance when the service, which exposes these metrics, is restarted):

  • The http_request_duration_sum metric shows the sum of all the request durations since the last service restart.
  • The http_request_duration_count metric shows the total number of requests since the last service restart.

So http_request_duration_sum / http_request_duration_count gives the average request duration since the service start. This metric isn't useful, since it smooths possible request duration spikes and the smooth factor increases over time. Usually people want to see the average request duration over the last N minutes. This can be calculated by wrapping the counters into increase() function with the needed lookbehind duration in square brackets. For example, the following query returns the average request duration over the last 5 minutes (see 5m in square brackets):

increase(http_request_duration_sum[5m]) / increase(http_request_duration_count[5m])

This query may return multiple time series if the http_request_duration metric is exposed at multiple apps (aka jobs) or nodes (aka instances or scrape targets). If you need to get the average request duration over the last 5 minutes per each job, then the sum function must be used:

sum(increase(http_request_duration_sum[5m])) by (job)
  /
sum(increase(http_request_duration_count[5m])) by (job)

Note that the sum(...) by (job) is applied individually to the left and the right part of /. This isn't equivalent to the following incorrect queries:

sum(
  increase(http_request_duration_sum[5m]) / increase(http_request_duration_count[5m])
) by (job)
avg(
  increase(http_request_duration_sum[5m]) / increase(http_request_duration_count[5m])
) by (job)

Since the first incorrect query calculates the sum of average response times per each job, while the second incorrect query calculates the average of averages of response times per each job. This is not what most users expect - see this answer for details.

valyala
  • 11,669
  • 1
  • 59
  • 62
  • Is there a list of all the http metrics and their definition? – Ayushmati Jul 11 '22 at 07:47
  • 1
    Every application exports its own set of metrics. So there is no a list of http-related metrics with their definition. Both Prometheus and Grafana support metric name auto-completion. For example, if you start writing `http_` in the query input field, then you'll see a list of metrics starting from`http_`. As for the description of these metrics, some applications provide metrics' description directly on the `/metrics` page from where these metrics are collected by Prometheus. Otherwise the best approach is to search metrics' description in Google. – valyala Jul 11 '22 at 12:51
3

I found out that the following query:

sum(rate(http_request_duration_sum))[24h] / sum(rate(http_request_duration_count))[24h] 

is the answer, I am looking for, giving me the average response time in seconds and only 1 curve pr query.

Of course the scrape_interval should not be 24h, so I've set it to [1m] instead. <- this according to this SO-answer

nelion
  • 1,712
  • 4
  • 17
  • 37
  • 1
    sum(irate(http_request_duration_sum{service=~"$service"}[2m])) by (service) / sum(irate(http_request_duration_count{service=~"$service"}[2m])) by (service) – suiwenfeng Nov 06 '19 at 10:20
  • just paste one used in production – suiwenfeng Nov 06 '19 at 10:21
  • thanks suiwenfeng. Yup, it's the same, but using irate instead of rate. In my case I would probably sort by (instance), but it gives the same output nonetheless. – nelion Nov 06 '19 at 11:12
  • It is recommended to use `rate()` instead of `irate()` - see [this article](https://valyala.medium.com/why-irate-from-prometheus-doesnt-capture-spikes-45f9896d7832). – valyala Apr 13 '22 at 16:05
  • @suiwenfeng what does the 'by (service)' do in the query? – Ayushmati Jul 11 '22 at 09:01
  • 1
    @Ayushmati it means this aggragation calculated group by service. – suiwenfeng Jul 11 '22 at 10:04
1

Yes, those metrics coming from prometheus are counters. So, you should add rate/irate. Use irate for volatile and fast moving metrics