how do I measure error budget consumption for rolling windows?

Question

I have a SLO for one application where 95% of service response times must be less than 450ms over a rolling 24 hour window. I sample once every 60 seconds. Typically my "current service level" is around 96-97%. If the service level falls below 95% my error rate is > 5% and consequently burn rate > 1. A burn rate > 1 means I will consume all error budget in less than 24 hours. However, since this is a rolling window, every minute I am calculating the service level, error rate, burn rate etc and even if I am operating at 94%, it's just a "constant 94%" and my error budget is being replenished every 60 seconds.

I am struggling to understand and measure accurately and show budget depletion.

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Dec 19 '21 at 11:02

score 1 · Answer 1 · answered Mar 22 '22 at 17:26

Your error budget is defined over a rolling window, same as your SLO.

SLO (service level objective) consists of an SLI (service level indication), a threshold (yours is 95%), and an evaluation period (previous 28 days is common, yours was stated as 24 hours). Your SLI is your 'definition of good', which you stated was 'percent of sample requests receiving responses in < 450ms'. Since you've defined your SLO threshold as 95%, the error budget for this service is 5% (1-.95), for the same trailing evaluation period.

In a request based system you'll get good signal from observing successful sample counts divided by total sample counts. Similarly, when observing responses of the service you'll get good signal from observing successful response counts divided by total valid response counts. Either provides a statistically sound volume-weighted ratio. And when we say 'counts' it means quite literally the counts observed in the evaluation period, in the rolling window. The error budget is, as you say, replenished.

It the time-weighted calculation that seems to be the issue here, calculating "a good minute" is effective when minutes are comparable. But requests are bursty, and time-weighting causes the previous 99 minutes with 1 successful request each (no fails) to have equal contribution to the current minute with 1000 fails (no successes), as an example. Its the difference between a time-weighted 99% SLI for those 100 minutes, or a volume-weighted 9% for the experience your customer is actually having.

how do I measure error budget consumption for rolling windows?

1 Answers1