In GCP, I have an alerting policy on database CPU and memory usage. For example, if CPU is over 50% over a 1m period, the alert fires.
The alert is kind of noisy. With other systems, I've been able to alert only if the threshold is violated multiple times, e.g.
- If the threshold is violated for 2 consecutive minutes.
- If over a 5 minute period, the threshold is violated in 3 of those minutes.
(Note: I don't want to simply change my alignment period to 2 minutes.)
There are a couple things I've seen in the GCP alert configuration that might help here:
- Change the "trigger"
- UI: "Alert trigger: Number of time series violates", "Minimum number of time series in violation: 2".
- JSON:
"trigger": {"count": 2}
- Change the "retest window"
- UI: "Advanced Options" → "Retest window: 2m"
- JSON:
"duration": "120s"
But I can't figure out exactly how these work. Can these be used to achieve the goal?