The http_request_duration_sum
and http_request_duration_count
are metrics of counter type, so they usually increase over time and may sometimes reset to zero (for instance when the service, which exposes these metrics, is restarted):
- The
http_request_duration_sum
metric shows the sum of all the request durations since the last service restart.
- The
http_request_duration_count
metric shows the total number of requests since the last service restart.
So http_request_duration_sum / http_request_duration_count
gives the average request duration since the service start. This metric isn't useful, since it smooths possible request duration spikes and the smooth factor increases over time. Usually people want to see the average request duration over the last N
minutes. This can be calculated by wrapping the counters into increase() function with the needed lookbehind duration in square brackets. For example, the following query returns the average request duration over the last 5 minutes (see 5m
in square brackets):
increase(http_request_duration_sum[5m]) / increase(http_request_duration_count[5m])
This query may return multiple time series if the http_request_duration
metric is exposed at multiple apps (aka jobs) or nodes (aka instances or scrape targets). If you need to get the average request duration over the last 5 minutes per each job, then the sum function must be used:
sum(increase(http_request_duration_sum[5m])) by (job)
/
sum(increase(http_request_duration_count[5m])) by (job)
Note that the sum(...) by (job)
is applied individually to the left and the right part of /
. This isn't equivalent to the following incorrect queries:
sum(
increase(http_request_duration_sum[5m]) / increase(http_request_duration_count[5m])
) by (job)
avg(
increase(http_request_duration_sum[5m]) / increase(http_request_duration_count[5m])
) by (job)
Since the first incorrect query calculates the sum of average response times per each job
, while the second incorrect query calculates the average of averages of response times per each job
. This is not what most users expect - see this answer for details.