Kubelet + prometheus: how to query if a pod is crashing?

Question

I want to set up alerts when any pod in my Kubernetes cluster is in a CrashloopBackOff state. I'm running Kubelet on Azure Kubernetes Services and have set up a Prometheus Operator which exposes metrics/cadvisor.

Other similar questions on this topic, such as this and this are not relevant to Kubelet setups. The recommended kube_pod_container_status_waiting_reason{}/kube_pod_status_phase{phase="Pending|Unknown|Failed"} and similar queries are not available to me with Kubelet on AKS.

Kubelet has somewhat limited metrics, here is what I have tried:

Container state:

container_tasks_state{container='my_container', kubernetes_azure_com_cluster='my_cluster'}

This seems like it should be the right solution, but the state is always 0, whether Running or in CrashloopBackOff. This seems to be a known bug.

Time from start:

time() - container_start_time_seconds{kubernetes_azure_com_cluster='my_cluster', container='my_container'}

We can here notify when the time the container is live is low. Any pod with a repeat alert is crashing. Inelegant as healthy containers will also notify until they've lived long enough, also my alert channel becomes very noisy.

Detect exited containers:

kubelet_running_containers{kubernetes_azure_com_cluster='my_cluster', container_state='exited'}

Can detect a crashing container, but containers may also exit gracefully, so a notification on container exits is not very useful. We essentially get a 'container exited' alert and then need to manually check whether it was a crash or graceful exit.

Number of running pods:

kubelet_running_pods{kubernetes_azure_com_cluster='my_cluster'}

Does not change on a container crash.

Scrape error:

container_scrape_error{kubernetes_azure_com_cluster='my_cluster'}

Again, does not change on a container crash.

Which query will allow me to discover if a pod has entered the CrashloopBackOff state?

Kubelet + prometheus: how to query if a pod is crashing?

0 Answers0