I have an alarm tracking the metric for LoadBalancer 5xx errors n a single ALB. This should be in an "In alarm" state if 1 datapoint in the past 1 is above the threshold of 2. The period is set to 1 minute. See the alarm details:
On 2020-09-23 at 17:18 UTC the Load Balancer started to return 502 errors. This is shown in the Cloudwatch metric chart below, and I've confirmed the times are correct (this was a forced 502 response so I know when I triggered it and I can see the 17:18 timestamp in the ALB logs)
But in the alarm log, the "In Alarm" state was only triggered at 17:22 UTC - 4 minutes after the 17:18 period had more than 2 errors. This isn't a delay in receiving a notification - it's about a delay in the state change compared to my expectation. Notifications were correctly received within seconds of the state change.
Here is the Alarm log with the state change timestamps:
We consider missing data as GOOD, so based on the metric graph, I assume it should have recovered to OK at 17:22 (after the 17:21 period with 0 errors) but only returned to OK at 17:27 - 5minutes delay.
I then expected it to return to "In alarm" at 17:24, but this didn't return until 17:28.
Finally, I expect it to have returned to OK at 17:31 but it took until 17:40 - a full 9 minutes afterwards.
Why is there a 4-9 minute delay between when I expect a state transition and it actually happening?