11

I have a question about calculating response times with Prometheus summary metrics.

I created a summary metric that does not only contain the service name but also the complete path and the http-method.

Now I try to calculate the average response time for the complete service. I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.

As far as I read this should be the correct way to calculate the response time per second:

sum by(service_id) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    /
    rate(request_duration_count{status_code=~"2.*"}[5m])
)

What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.

This looks absolutely wrong for me - but I think it does not work in the way I understand it.

Another way to get an equal looking result is this:

sum without (path,host) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    /
    rate(request_duration_count{status_code=~"2.*"}[5m])
)
  • But what is the difference?
  • What is really happening here?
  • And why do I honestly only get measurable values if I use "max" instead of "sum"?

If I would ignore everything I read I would try it in the following way:

rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
/
rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])

But this will not work at all... (instant vector vs range vector and so on...).

halfer
  • 19,824
  • 17
  • 99
  • 186
eventhorizon
  • 2,977
  • 8
  • 33
  • 57

3 Answers3

11

All of these examples are aggregating incorrectly, as you're averaging an average. You want:

  sum without (path,host) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
  )
/
  sum without (path,host) (
    rate(request_duration_count{status_code=~"2.*"}[5m])
  )

Which will return the average latency per status_code plus any other remaining labels.

brian-brazil
  • 31,678
  • 6
  • 93
  • 86
  • 4
    I think this is right - because you wrote it. But I would like to understand what is really done by the given queries. What lecture do I have to study? Online-Courses, Bible...? ;-) – eventhorizon Jun 27 '18 at 17:46
  • 2
    Try the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/). – Alin Sînpălean Jun 28 '18 at 10:12
5
  • The by modifier groups aggregate function results by labels enumerated inside by(...).
  • The without modifier groups aggregate function results by all the labels except those enumerated inside without(...).

For example, suppose process_resident_memory_bytes metric exists with job, instance and datacenter labels:

process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc1"} N1
process_resident_memory_bytes{job="job1",instance="host2",datacenter="dc1"} N2
process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc2"} N3
process_resident_memory_bytes{job="job2",instance="host1",datacenter="dc1"} N4

Then sum(process_resident_memory_bytes) by (datacenter) would return summary per-datacenter memory usage, while sum(process_resident_memory_bytes) without (instance) would return summary per-job per-datacenter memory usage.

valyala
  • 11,669
  • 1
  • 59
  • 62
0

Using Prometheus metrics in Grafana, the without keyword did not work for me (at least as I expected it to). I got satisfying results with by:

  sum by (status_code)(
    rate(request_duration_sum{status_code=~"2.*"}[5m])
  )
/
  sum by (status_code)(
    rate(request_duration_sum{status_code=~"2.*"}[5m])
  )
Nicolas Gaborel
  • 549
  • 1
  • 5
  • 16