Your error budget is defined over a rolling window, same as your SLO.
SLO (service level objective) consists of an SLI (service level indication), a threshold (yours is 95%), and an evaluation period (previous 28 days is common, yours was stated as 24 hours). Your SLI is your 'definition of good', which you stated was 'percent of sample requests receiving responses in < 450ms'. Since you've defined your SLO threshold as 95%, the error budget for this service is 5% (1-.95), for the same trailing evaluation period.
In a request based system you'll get good signal from observing successful sample counts divided by total sample counts. Similarly, when observing responses of the service you'll get good signal from observing successful response counts divided by total valid response counts. Either provides a statistically sound volume-weighted ratio. And when we say 'counts' it means quite literally the counts observed in the evaluation period, in the rolling window. The error budget is, as you say, replenished.
It the time-weighted calculation that seems to be the issue here, calculating "a good minute" is effective when minutes are comparable. But requests are bursty, and time-weighting causes the previous 99 minutes with 1 successful request each (no fails) to have equal contribution to the current minute with 1000 fails (no successes), as an example. Its the difference between a time-weighted 99% SLI for those 100 minutes, or a volume-weighted 9% for the experience your customer is actually having.