Alert if a docker container stops

Question

I'm monitoring several containers using Prometheus, cAdvisor and Prometheus Alertmanager. What I want is to get an alert if a container goes down for some reason. Problem is if a container dies there is no metrics collected by the cAdvisor. Any query returns 'no data' since there are no matches for the query.

score 7 · Accepted Answer · edited Jun 20 '20 at 09:12

Take a look at Prometheus function absent()

absent(v instant-vector) returns an empty vector if the vector passed to it has any elements and a 1-element vector with the value 1 if the vector passed to it has no elements.

This is useful for alerting on when no time series exist for a given metric name and label combination.

examples:

absent(nonexistent{job="myjob"}) => {job="myjob"} absent(nonexistent{job="myjob",instance=~".*"}) => {job="myjob"} absent(sum(nonexistent{job="myjob"})) => {}

here is an example for an alert:

ALERT kibana_absent
  IF absent(container_cpu_usage_seconds_total{com_docker_compose_service="kibana"})
  FOR 5s
  LABELS {
    severity="page"
  }
  ANNOTATIONS {
  SUMMARY= "Instance {{$labels.instance}} down",
  DESCRIPTION= "Instance= {{$labels.instance}}, Service/Job ={{$labels.job}} is down for more than 5 sec."
  }

What can i do if i have many containers and i do not want use hardcode in naming each container for this alert? Can you help? — a1dude, Feb 18 '22 at 16:02

score 3 · Answer 2 · answered Jan 12 '19 at 17:10

3

I use a small tool called Docker Event Monitor that runs as a container on the Docker host and sends out alerts to Slack, Discord or SparkPost if certain events are triggered. You can configure which events trigger alerts.

answered Jan 12 '19 at 17:10

Bruce

41
1

Andromeda · Answer 3 · 2021-01-27T04:47:29.910

0

Try this:

 time() - container_last_seen{label="whatever-label-you-have", job="myjob"} > 60

If a container cannot be seen for 60 seconds, it fires an alarm. Or

absent(container_memory_usage_bytes{label="whatever-label-you-have", job="myjob"})

Please be careful, in second approach it may take time that memory usage by the container get to 0.

edited Jan 27 '21 at 04:47

answered Jan 26 '21 at 11:18

Andromeda

1,205
1
14
21

score 0 · Answer 4 · edited Jan 29 '21 at 15:23

We can use those two :

absent(container_start_time_seconds{name="my-container"})

This specific metric which contains a timestamps seems not going stale during 5 minutes so it disappear from prometheus results as soon as it disappear from last scrape (see: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness) and not after 5minutes like container_cpu_usage_seconds_total for instance. Tested OK, but I'm not sure if I understand well the slateness...

Else you can use this one :

time() - timestamp(container_cpu_usage_seconds_total{name="mycontainer"}) > 60 OR absent(container_cpu_usage_seconds_total{name="mycontainer"})

The first part gives how much time since the metric has been scraped. So this works if it disappeared from the exporter output but still returned by promql (during 5 minutes by default). You have to adapt the >60 with your scrape interval.

score 0 · Answer 5 · answered Apr 01 '22 at 10:54

cadvisor exports container_last_seen metric, which shows the timestamp when the container was seen last time. See these docs. But cadvisor stops exporting container_last_seen metric in a few minutes after the container stops - see this issue for details. So time() - container_last_seen > 60 may miss stopped containers. This can be fixed by wrapping container_last_seen into last_over_time() function. For example, the following query consistently returns containers, which have been stopped more than 60 seconds ago but less than 1 hour ago (see 1h lookbehind window in square brackets):

time() - last_over_time(container_last_seen{container!=""}[1h]) > 60

This query can be simplified further when using lag function from MetricsQL:

lag(container_last_seen{container!=""}[1h]) > 1m

The container!="" filter is needed for filtering out artificial metrics for cgroups hierarchy - see this answer for more details.

Alert if a docker container stops

5 Answers5