4

I'm monitoring several containers using Prometheus, cAdvisor and Prometheus Alertmanager. What I want is to get an alert if a container goes down for some reason. Problem is if a container dies there is no metrics collected by the cAdvisor. Any query returns 'no data' since there are no matches for the query.

Christian Will
  • 1,529
  • 3
  • 17
  • 25

5 Answers5

7

Take a look at Prometheus function absent()

absent(v instant-vector) returns an empty vector if the vector passed to it has any elements and a 1-element vector with the value 1 if the vector passed to it has no elements.

This is useful for alerting on when no time series exist for a given metric name and label combination.

examples:

absent(nonexistent{job="myjob"}) => {job="myjob"} absent(nonexistent{job="myjob",instance=~".*"}) => {job="myjob"} absent(sum(nonexistent{job="myjob"})) => {}

here is an example for an alert:

ALERT kibana_absent
  IF absent(container_cpu_usage_seconds_total{com_docker_compose_service="kibana"})
  FOR 5s
  LABELS {
    severity="page"
  }
  ANNOTATIONS {
  SUMMARY= "Instance {{$labels.instance}} down",
  DESCRIPTION= "Instance= {{$labels.instance}}, Service/Job ={{$labels.job}} is down for more than 5 sec."
  }
Community
  • 1
  • 1
Mr Peabody
  • 86
  • 4
  • 2
    What can i do if i have many containers and i do not want use hardcode in naming each container for this alert? Can you help? – a1dude Feb 18 '22 at 16:02
3

I use a small tool called Docker Event Monitor that runs as a container on the Docker host and sends out alerts to Slack, Discord or SparkPost if certain events are triggered. You can configure which events trigger alerts.

Bruce
  • 41
  • 1
0

Try this:

 time() - container_last_seen{label="whatever-label-you-have", job="myjob"} > 60

If a container cannot be seen for 60 seconds, it fires an alarm. Or

absent(container_memory_usage_bytes{label="whatever-label-you-have", job="myjob"})

Please be careful, in second approach it may take time that memory usage by the container get to 0.

Andromeda
  • 1,205
  • 1
  • 14
  • 21
0

We can use those two :

absent(container_start_time_seconds{name="my-container"})

This specific metric which contains a timestamps seems not going stale during 5 minutes so it disappear from prometheus results as soon as it disappear from last scrape (see: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness) and not after 5minutes like container_cpu_usage_seconds_total for instance. Tested OK, but I'm not sure if I understand well the slateness...

Else you can use this one :

time() - timestamp(container_cpu_usage_seconds_total{name="mycontainer"}) > 60 OR absent(container_cpu_usage_seconds_total{name="mycontainer"})

The first part gives how much time since the metric has been scraped. So this works if it disappeared from the exporter output but still returned by promql (during 5 minutes by default). You have to adapt the >60 with your scrape interval.

Adrien
  • 1
0

cadvisor exports container_last_seen metric, which shows the timestamp when the container was seen last time. See these docs. But cadvisor stops exporting container_last_seen metric in a few minutes after the container stops - see this issue for details. So time() - container_last_seen > 60 may miss stopped containers. This can be fixed by wrapping container_last_seen into last_over_time() function. For example, the following query consistently returns containers, which have been stopped more than 60 seconds ago but less than 1 hour ago (see 1h lookbehind window in square brackets):

time() - last_over_time(container_last_seen{container!=""}[1h]) > 60

This query can be simplified further when using lag function from MetricsQL:

lag(container_last_seen{container!=""}[1h]) > 1m

The container!="" filter is needed for filtering out artificial metrics for cgroups hierarchy - see this answer for more details.

valyala
  • 11,669
  • 1
  • 59
  • 62