11

I have faced some issues with Prometheus memory alert. If I take the backup of Gitlab then memory usage going up to 95%. I want to snooze memory alert for a specific time.

e.g. If I am taking a backup at 2 AM then I need to snooze Prometheus memory alert. Is it possible?

James Z
  • 12,209
  • 10
  • 24
  • 44
Abhijit
  • 111
  • 1
  • 1
  • 3

7 Answers7

16

As Marcelo said, there is no way to schedule a silence but if the backup is made at regular interval (say every night from 2am to 3am), you can include that in the alert expression.

- alert: OutOfMemory
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 AND ON() absent(hour() >= 2 <= 3)

This can rapidly become tedious if you want to silence many rules (or if you want more complex schedules of inhibition). In that case, you can use inhibition rules of alert manager in the following way.

First step is to define an alert, in Prometheus, fired at the time you want the inhibition to take place:

- alert: BackupHours
  expr: hour() >= 2 <= 3
  for: 1m
  labels:
    notification: none
  annotations:
    description: 'This alert fires during backup hours to inhibit others'

Remember to add a route in alert manager to avoid notifying this alert:

routes:
  - match:
      notification: none
    receiver: do_nothing
receivers:
- name: do_nothing

And then use inhibition rules to silence target rules during that time:

inhibit_rules:
- source_match:
    alertname: BackupHours
  target_match:
    # here can be any other selection of alert
    alertname: OutOfMemory

Note that it only works out of the box for UTC computation. If you need DST, it requires more boilerplate (with recording rules by example).

As a side note, if you are monitoring your backup process, you may already have a metric that indicate the backup is under way. If so, you could use this metrics to inhibit the other alerts and you wouldn't need to maintain a schedule.

Michael Doubez
  • 5,937
  • 25
  • 39
1

No, it's not possible to have scheduled silences.

Some workarounds for your case:

1) Maybe you can change your Prometheus configuration and increase the "for" clause to give more time to execute the backup without trigging the alert.

2) You can use the REST API to create/delete silences at the beginning/ending of the backup.

See more info about this subject here.

1

You can compare conditions back in history and therefore alert won't popup if metrics doesn't differ more than 2 times for the past two days at this time.

      - alert: CPULoadAlert
        # Condition for alerting
        expr: >-
          node_load5 / node_load5 offset 1d > 2 and
          node_load5 / node_load5 offset 2d > 2 and
          node_load5 > 1
        for: 5m
        # Annotation - additional informational labels to store more information
        annotations:
          summary: 'Instance {{ $labels.instance }} got an unusual high load on CPU'
          description: '{{ $labels.instance }} of job {{ $labels.job }} got CPU spike over 2x compared to previous 2 days.'
        # Labels - additional labels to be attached to the alert
        labels:
          severity: 'warning'
AYB
  • 21
  • 3
1

You can take a different approach and mute the notifications instead of snoozing the alerts. This solution may work for you if you don't want to receive notifications but you're OK with the alert

Time intervals will effectively do this.

Add following to your alertmanager.yml

time_intervals:
 - name: backuphours
    time_intervals:
    - times:
      - start_time: '02:00'
        end_time: '03:00'

Then add following yaml to your route.

- receiver: 'sysops-pager'
  matchers:
    - alertname: OutOfMemory
  mute_time_intervals:
    - backuphours
Haldun
  • 56
  • 3
1

AlertManager officially supports this with this pull request merged. Following is the example taken from the pull request.

mute_time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times: 
        - start_time: "09:00"
          end_time: "17:00"

Then they can be referenced in the routing tree like so:

# The root route on which each incoming alert enters.
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 1s
  group_interval: 30s

  # A default receiver
  receiver: team-X-mail
  routes:
    - match:
        alertname: NodeIsDown
      mute_times:
        - business_hours
      receiver: team-X-mail
VanagaS
  • 3,130
  • 3
  • 27
  • 41
0

I would like to comment on @Michael Doubez, but I' do not have enough points yet.

I am writing an exporter that signals that a maintenance window is active and that metric can then be used to inhibit alerts using an inhibit rule. You can define multiple maintenance windows with an good old fashioned cron expression. See https://github.com/jzandbergen/maintenance-exporter

0

To mute the alerts for specific period of time in Prometheus at a regular basis, this can be achieved using the mute_time_intervals feature in AlertManager Level. For detailed info, please refer to this article.

https://techyen.com/how-to-mute-the-alerts-for-a-particular-time-in-alert-manager/