I have my kubernetes cluster setup on AWS where I am trying to monitor several pods, using cAdvisor + Prometheus + Alert manager. What I want to do it launch an email alert (with service/container name) if a container/pod goes down or stuck in Error or CarshLoopBackOff state or stcuk in anyother state apart from running.
Asked
Active
Viewed 1.5k times
2 Answers
15
Prometheus collects a wide range of metrics. As an example, you can use a metric kube_pod_container_status_restarts_total
for monitoring restarts, which will reflect your problem.
It containing tags which you can use in the alert:
- container=
container-name
- namespace=
pod-namespace
- pod=
pod-name
So, everything you need is to configure your alertmanager.yaml
config by adding correct SMTP settings, receiver and rules like that:
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.org'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
receivers:
- name: 'team-X-mails'
email_configs:
- to: 'team-X+alerts@example.org'
# Only one default receiver
route:
receiver: team-X-mails
# Example group with one alert
groups:
- name: example-alert
rules:
# Alert about restarts
- alert: RestartAlerts
expr: count(kube_pod_container_status_restarts_total) by (pod-name) > 5
for: 10m
annotations:
summary: "More than 5 restarts in pod {{ $labels.pod-name }}"
description: "{{ $labels.container-name }} restarted (current value: {{ $value }}s) times in pod {{ $labels.pod-namespace }}/{{ $labels.pod-name }}"

ckujau
- 225
- 1
- 15

Anton Kostenko
- 8,200
- 2
- 30
- 37
-
You mean if a pod is stuck in Crashloopbackoff it means it will restart multiple to recover from this state “Crashloopbackoff” ? Also how to monitor if Prometheus pods(server,alertmanager) itself stuck in this state ? – shiv455 Mar 26 '18 at 12:57
-
1. Yes, here is an example of the state in that situation - `nfs-web-fdr9h 0/1 CrashLoopBackOff 8 16m`. So, 8 here - is a count of restarts. 2. Yes, it monitors themselves, because they are pods too. However, of course, Prometheus and Alertmanager can watch itself and send an alert when they are working. If they are down - who will send an alert?:) – Anton Kostenko Mar 26 '18 at 15:01
-
when i run kube_pod_container_status_restarts_total in prometheus databse it gives me "no Data" i have killed kube-dns pod and recreated new one though – shiv455 Mar 26 '18 at 21:25
-
Check that you have a [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) installed in your cluster. Btw, here is another one [set of rules](https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/assets/prometheus/rules/kube-state-metrics.rules.yaml) which can be helpful. – Anton Kostenko Mar 27 '18 at 07:20
-
i do have kube-state-metrics running – shiv455 Mar 27 '18 at 16:46
-
1Shouldn't `pod-name` rather just be `pod`? Same for `container-name` etc...? – Dan Dec 20 '19 at 15:19
-
2Doesn't "count" take to account all the occurances "EVER" of the pod being restarted ? Wouldnt `sum by (pod) (increase(kube_pod_container_status_restarts_total[5m])) > 2` be better because well, it might have restarted, but if it recovered - all is fine. – Daniel Hajduk Feb 03 '20 at 12:19
0
I'm using this one :
- alert: PodCrashLooping
annotations:
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
summary: Pod is crash looping.
expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace=~".*"}[5m]) * 60 * 5 > 0
for: 5m
labels:
severity: critical

dansl1982
- 998
- 1
- 7
- 10