0

I want to set up alerts when any pod in my Kubernetes cluster is in a CrashloopBackOff state. I'm running Kubelet on Azure Kubernetes Services and have set up a Prometheus Operator which exposes metrics/cadvisor.

Other similar questions on this topic, such as this and this are not relevant to Kubelet setups. The recommended kube_pod_container_status_waiting_reason{}/kube_pod_status_phase{phase="Pending|Unknown|Failed"} and similar queries are not available to me with Kubelet on AKS.

Kubelet has somewhat limited metrics, here is what I have tried:

  1. Container state:
container_tasks_state{container='my_container', kubernetes_azure_com_cluster='my_cluster'}

This seems like it should be the right solution, but the state is always 0, whether Running or in CrashloopBackOff. This seems to be a known bug.

  1. Time from start:
time() - container_start_time_seconds{kubernetes_azure_com_cluster='my_cluster', container='my_container'}

We can here notify when the time the container is live is low. Any pod with a repeat alert is crashing. Inelegant as healthy containers will also notify until they've lived long enough, also my alert channel becomes very noisy.

  1. Detect exited containers:
kubelet_running_containers{kubernetes_azure_com_cluster='my_cluster', container_state='exited'}

Can detect a crashing container, but containers may also exit gracefully, so a notification on container exits is not very useful. We essentially get a 'container exited' alert and then need to manually check whether it was a crash or graceful exit.

  1. Number of running pods:
kubelet_running_pods{kubernetes_azure_com_cluster='my_cluster'}

Does not change on a container crash.

  1. Scrape error:
container_scrape_error{kubernetes_azure_com_cluster='my_cluster'}

Again, does not change on a container crash.

Which query will allow me to discover if a pod has entered the CrashloopBackOff state?

Student
  • 522
  • 1
  • 6
  • 18

0 Answers0