I'm monitoring several containers using Prometheus, cAdvisor and Prometheus Alertmanager. What I want is to get an alert if a container goes down for some reason. Problem is if a container dies there is no metrics collected by the cAdvisor. Any query returns 'no data' since there are no matches for the query.
5 Answers
Take a look at Prometheus function absent()
absent(v instant-vector) returns an empty vector if the vector passed to it has any elements and a 1-element vector with the value 1 if the vector passed to it has no elements.
This is useful for alerting on when no time series exist for a given metric name and label combination.
examples:
absent(nonexistent{job="myjob"}) => {job="myjob"}
absent(nonexistent{job="myjob",instance=~".*"}) => {job="myjob"}
absent(sum(nonexistent{job="myjob"})) => {}
here is an example for an alert:
ALERT kibana_absent
IF absent(container_cpu_usage_seconds_total{com_docker_compose_service="kibana"})
FOR 5s
LABELS {
severity="page"
}
ANNOTATIONS {
SUMMARY= "Instance {{$labels.instance}} down",
DESCRIPTION= "Instance= {{$labels.instance}}, Service/Job ={{$labels.job}} is down for more than 5 sec."
}

- 1
- 1

- 86
- 4
-
2What can i do if i have many containers and i do not want use hardcode in naming each container for this alert? Can you help? – a1dude Feb 18 '22 at 16:02
I use a small tool called Docker Event Monitor that runs as a container on the Docker host and sends out alerts to Slack, Discord or SparkPost if certain events are triggered. You can configure which events trigger alerts.

- 41
- 1
Try this:
time() - container_last_seen{label="whatever-label-you-have", job="myjob"} > 60
If a container cannot be seen for 60 seconds, it fires an alarm. Or
absent(container_memory_usage_bytes{label="whatever-label-you-have", job="myjob"})
Please be careful, in second approach it may take time that memory usage by the container get to 0.

- 1,205
- 1
- 14
- 21
We can use those two :
absent(container_start_time_seconds{name="my-container"})
This specific metric which contains a timestamps seems not going stale during 5 minutes so it disappear from prometheus results as soon as it disappear from last scrape (see: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness) and not after 5minutes like container_cpu_usage_seconds_total for instance. Tested OK, but I'm not sure if I understand well the slateness...
Else you can use this one :
time() - timestamp(container_cpu_usage_seconds_total{name="mycontainer"}) > 60 OR absent(container_cpu_usage_seconds_total{name="mycontainer"})
The first part gives how much time since the metric has been scraped. So this works if it disappeared from the exporter output but still returned by promql (during 5 minutes by default). You have to adapt the >60 with your scrape interval.

- 6,656
- 4
- 18
- 22

- 1
cadvisor exports container_last_seen
metric, which shows the timestamp when the container was seen last time. See these docs. But cadvisor
stops exporting container_last_seen
metric in a few minutes after the container stops - see this issue for details. So time() - container_last_seen > 60
may miss stopped containers. This can be fixed by wrapping container_last_seen
into last_over_time() function. For example, the following query consistently returns containers, which have been stopped more than 60 seconds ago but less than 1 hour ago (see 1h
lookbehind window in square brackets):
time() - last_over_time(container_last_seen{container!=""}[1h]) > 60
This query can be simplified further when using lag function from MetricsQL:
lag(container_last_seen{container!=""}[1h]) > 1m
The container!=""
filter is needed for filtering out artificial metrics for cgroups hierarchy - see this answer for more details.

- 11,669
- 1
- 59
- 62