I want to set up alerts when any pod in my Kubernetes cluster is in a CrashloopBackOff
state. I'm running Kubelet on Azure Kubernetes Services and have set up a Prometheus Operator which exposes metrics/cadvisor
.
Other similar questions on this topic, such as this and this are not relevant to Kubelet setups. The recommended kube_pod_container_status_waiting_reason{}
/kube_pod_status_phase{phase="Pending|Unknown|Failed"}
and similar queries are not available to me with Kubelet on AKS.
Kubelet has somewhat limited metrics, here is what I have tried:
- Container state:
container_tasks_state{container='my_container', kubernetes_azure_com_cluster='my_cluster'}
This seems like it should be the right solution, but the state is always 0, whether Running
or in CrashloopBackOff
. This seems to be a known bug.
- Time from start:
time() - container_start_time_seconds{kubernetes_azure_com_cluster='my_cluster', container='my_container'}
We can here notify when the time the container is live is low. Any pod with a repeat alert is crashing. Inelegant as healthy containers will also notify until they've lived long enough, also my alert channel becomes very noisy.
- Detect exited containers:
kubelet_running_containers{kubernetes_azure_com_cluster='my_cluster', container_state='exited'}
Can detect a crashing container, but containers may also exit gracefully, so a notification on container exits is not very useful. We essentially get a 'container exited' alert and then need to manually check whether it was a crash or graceful exit.
- Number of running pods:
kubelet_running_pods{kubernetes_azure_com_cluster='my_cluster'}
Does not change on a container crash.
- Scrape error:
container_scrape_error{kubernetes_azure_com_cluster='my_cluster'}
Again, does not change on a container crash.
Which query will allow me to discover if a pod has entered the CrashloopBackOff
state?