Difference between PromQL "by" and "without" unclear

Question

I have a question about calculating response times with Prometheus summary metrics.

I created a summary metric that does not only contain the service name but also the complete path and the http-method.

Now I try to calculate the average response time for the complete service. I read the article about "rate then sum" and either I do not understand how the calculation is done or the calculation is IMHO not correct.

As far as I read this should be the correct way to calculate the response time per second:

sum by(service_id) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    /
    rate(request_duration_count{status_code=~"2.*"}[5m])
)

What I understand here is create the "duration per second" (rate sum / rate count) value for each subset and then creates the sum per service_id.

This looks absolutely wrong for me - but I think it does not work in the way I understand it.

Another way to get an equal looking result is this:

sum without (path,host) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
    /
    rate(request_duration_count{status_code=~"2.*"}[5m])
)

But what is the difference?
What is really happening here?
And why do I honestly only get measurable values if I use "max" instead of "sum"?

If I would ignore everything I read I would try it in the following way:

rate(sum by(service_id) request_duration_sum{status_code=~"2.*"}[5m])
/
rate(sum by(service_id) request_duration_count{status_code=~"2.*"}[5m])

But this will not work at all... (instant vector vs range vector and so on...).

score 11 · Accepted Answer · answered Jun 27 '18 at 14:22

11

All of these examples are aggregating incorrectly, as you're averaging an average. You want:

  sum without (path,host) (
    rate(request_duration_sum{status_code=~"2.*"}[5m])
  )
/
  sum without (path,host) (
    rate(request_duration_count{status_code=~"2.*"}[5m])
  )

Which will return the average latency per status_code plus any other remaining labels.

answered Jun 27 '18 at 14:22

brian-brazil

31,678
6
93
86

4

I think this is right - because you wrote it. But I would like to understand what is really done by the given queries. What lecture do I have to study? Online-Courses, Bible...? ;-) – eventhorizon Jun 27 '18 at 17:46
2

Try the [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/). – Alin Sînpălean Jun 28 '18 at 10:12

score 5 · Answer 2 · answered Apr 06 '22 at 17:11

The by modifier groups aggregate function results by labels enumerated inside by(...).
The without modifier groups aggregate function results by all the labels except those enumerated inside without(...).

For example, suppose process_resident_memory_bytes metric exists with job, instance and datacenter labels:

process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc1"} N1
process_resident_memory_bytes{job="job1",instance="host2",datacenter="dc1"} N2
process_resident_memory_bytes{job="job1",instance="host1",datacenter="dc2"} N3
process_resident_memory_bytes{job="job2",instance="host1",datacenter="dc1"} N4

Then sum(process_resident_memory_bytes) by (datacenter) would return summary per-datacenter memory usage, while sum(process_resident_memory_bytes) without (instance) would return summary per-job per-datacenter memory usage.

score 0 · Answer 3 · answered Dec 02 '19 at 13:56

Using Prometheus metrics in Grafana, the without keyword did not work for me (at least as I expected it to). I got satisfying results with by:

  sum by (status_code)(
    rate(request_duration_sum{status_code=~"2.*"}[5m])
  )
/
  sum by (status_code)(
    rate(request_duration_sum{status_code=~"2.*"}[5m])
  )

Difference between PromQL "by" and "without" unclear

3 Answers3

Linked