4

The query that I am using to grab 99th percentile of API request latency is:

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime"}[1m])) by (handler, method, le))

My buckets for latency histogram buckets are defined as [0.05, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0] in my code or hitting the metrics endpoint for a sample API endpoint (i.e. TestController.java class, and testLatencyTime() method):

http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="0.05",} 
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="0.25",} 
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="0.5",} 
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="1.0",} 
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="2.0",} 
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="4.0",}
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="8.0",} 
http_request_duration_seconds_bucket{method="POST",handler="TestController.testLatencyTime",code="200",le="+Inf",} 

So its a http_request_duration_seconds_bucket function, passed to a rate function. Per this stackoverflow post, they stated that the rate function should be a Cumulative Density Distribution function, that is "rate applied on buckets calculates a set of rate of increments that happened on all the buckets in the span of the last 1 minute. So, to answer your question, it is a cumulative density distribution on the rate of changes calculated in a given time frame". Per this YouTube video, is it correct to assume the Cumulative Density Distribution function is the area under the curve to the LEFT of a point of interest? https://www.youtube.com/watch?v=3xAIWiTJCvE

reference:

what's the math behind histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])) in PromQL

Furthermore, this is passed to the sum function, where the values returned by the rate function are summed or aggregated. I'm trying to understand this sentence from the prometheus documentation - "The quantile is calculated for each label combination in http_request_duration_seconds"

reference:

https://prometheus.io/docs/prometheus/latest/querying/functions/

My problem is that when I use this dummy REST Controller (Spring Boot, TestController.java class, and testLatencyTime() method to visualize the data in a locally running Prometheus/Grafana instance using Docker, if I make a "dummy" request with Thread.sleep(4000) and see how Grafana plots it, its not making sense

@RestController
@RequestMapping("/test")
public class TestController  {

    @PostMapping("/latency/{wait}")
    public ResponseEntity<String> testLatencyTime(@PathVariable Long wait) throws InterruptedException {
        Thread.sleep(wait);
        return new ResponseEntity("Request completed!", HttpStatus.OK);
    }
}

For example, the above 4000ms sleep will get marked as an "8 second" spike in Grafana, for the 99th percentile query I posted above. Also, if I make an mock API call take 3seconds, it gets marked as 4seconds. It's almost as if Grafana is marking the graph at the UPPER BOUND bucket value the request falls into! (i.e. upper bound for 2sec or 3sec api call would be 4; upper bound for 4sec api call is 8) Is my understanding of the statistical mathematics behind how Grafana graphs this incorrect, or is this being plotted incorrectly or is my query is wrong/not accurate??? Please help! I can add any more information that is requested but I think I included a good amount! example: enter image description here

ennth
  • 1,698
  • 5
  • 31
  • 63
  • 1
    The inaccuracy of the `histogram_quantile()` is mostly related to bucket bounds. It can be reduced by increasing the number of buckets and reducing their sizes. There are alternative histogram implementations, which consistently reduce the calculation error and don't require manual configuration for the buckets - https://valyala.medium.com/improving-histogram-usability-for-prometheus-and-grafana-bc7e5df0e350 – valyala Jun 27 '21 at 09:39

0 Answers0