I have a question about calculating response times with Prometheus summary metrics.
I created a summary metric that does not only contain the service name but also the complete path and the http-method.
Now I try to calculate the average response time for the complete service. I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.
As far as I read this should be the correct way to calculate the response time per second:
sum by(service_id) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
/
rate(request_duration_count{status_code=~"2.*"}[5m])
)
What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.
This looks absolutely wrong for me - but I think it does not work in the way I understand it.
Another way to get an equal looking result is this:
sum without (path,host) (
rate(request_duration_sum{status_code=~"2.*"}[5m])
/
rate(request_duration_count{status_code=~"2.*"}[5m])
)
- But what is the difference?
- What is really happening here?
- And why do I honestly only get measurable values if I use "max" instead of "sum"?
If I would ignore everything I read I would try it in the following way:
rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
/
rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])
But this will not work at all... (instant vector vs range vector and so on...).