Alert when docker container pod is in Error or CarshLoopBackOff kubernetes

Question

I have my kubernetes cluster setup on AWS where I am trying to monitor several pods, using cAdvisor + Prometheus + Alert manager. What I want to do it launch an email alert (with service/container name) if a container/pod goes down or stuck in Error or CarshLoopBackOff state or stcuk in anyother state apart from running.

score 15 · Answer 1 · edited Oct 28 '20 at 13:11

15

Prometheus collects a wide range of metrics. As an example, you can use a metric kube_pod_container_status_restarts_total for monitoring restarts, which will reflect your problem.

It containing tags which you can use in the alert:

container=container-name
namespace=pod-namespace
pod=pod-name

So, everything you need is to configure your alertmanager.yaml config by adding correct SMTP settings, receiver and rules like that:

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'localhost:25'
  smtp_from: 'alertmanager@example.org'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

receivers:
- name: 'team-X-mails'
  email_configs:
  - to: 'team-X+alerts@example.org'

# Only one default receiver
route:
  receiver: team-X-mails

# Example group with one alert
groups:
- name: example-alert
  rules:
    # Alert about restarts
  - alert: RestartAlerts
    expr: count(kube_pod_container_status_restarts_total) by (pod-name) > 5
    for: 10m
    annotations:
      summary: "More than 5 restarts in pod {{ $labels.pod-name }}"
      description: "{{ $labels.container-name }} restarted (current value: {{ $value }}s) times in pod {{ $labels.pod-namespace }}/{{ $labels.pod-name }}"

edited Oct 28 '20 at 13:11

ckujau

225
1
15

answered Mar 26 '18 at 10:57

Anton Kostenko

8,200
2
30
37

You mean if a pod is stuck in Crashloopbackoff it means it will restart multiple to recover from this state “Crashloopbackoff” ? Also how to monitor if Prometheus pods(server,alertmanager) itself stuck in this state ? – shiv455 Mar 26 '18 at 12:57
1. Yes, here is an example of the state in that situation - `nfs-web-fdr9h 0/1 CrashLoopBackOff 8 16m`. So, 8 here - is a count of restarts. 2. Yes, it monitors themselves, because they are pods too. However, of course, Prometheus and Alertmanager can watch itself and send an alert when they are working. If they are down - who will send an alert?:) – Anton Kostenko Mar 26 '18 at 15:01
when i run kube_pod_container_status_restarts_total in prometheus databse it gives me "no Data" i have killed kube-dns pod and recreated new one though – shiv455 Mar 26 '18 at 21:25
Check that you have a [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) installed in your cluster. Btw, here is another one [set of rules](https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/assets/prometheus/rules/kube-state-metrics.rules.yaml) which can be helpful. – Anton Kostenko Mar 27 '18 at 07:20
i do have kube-state-metrics running – shiv455 Mar 27 '18 at 16:46
1

Shouldn't `pod-name` rather just be `pod`? Same for `container-name` etc...? – Dan Dec 20 '19 at 15:19
2

Doesn't "count" take to account all the occurances "EVER" of the pod being restarted ? Wouldnt `sum by (pod) (increase(kube_pod_container_status_restarts_total[5m])) > 2` be better because well, it might have restarted, but if it recovered - all is fine. – Daniel Hajduk Feb 03 '20 at 12:19

score 0 · Answer 2 · answered Aug 04 '21 at 14:27

I'm using this one :

    - alert: PodCrashLooping
  annotations:
    description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
    summary: Pod is crash looping.
  expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace=~".*"}[5m]) * 60 * 5 > 0
  for: 5m
  labels:
    severity: critical

Alert when docker container pod is in Error or CarshLoopBackOff kubernetes

2 Answers2

Linked