Monitoring and alerting on pod status or restart with Google Container Engine (GKE) and Stackdriver

Question

Is there a way to monitor the pod status and restart count of pods running in a GKE cluster with Stackdriver?

While I can see CPU, memory and disk usage metrics for all pods in Stackdriver there seems to be no way of getting metrics about crashing pods or pods in a replica set being restarted due to crashes.

I'm using a Kubernetes replica set to manage the pods, hence they are respawned and created with a new name when they crash. As far as I can tell the metrics in Stackdriver appear by pod-name (which is unique for the lifetime of the pod) which doesn't sound really sensible.

Alerting upon pod failures sounds like such a natural thing that it sounds hard to believe that this is not supported at the moment. The monitoring and alerting capabilities that I get from Stackdriver for Google Container Engine as they stand seem to be rather useless as they are all bound to pods whose lifetime can be very short.

So if this doesn't work out of the box are there known workarounds or best practices on how to monitor for continuously crashing pods?

I am working as well on a similar solution .. At the moment I didn't find a lot regarding what you ask and other similar metrics that can be interesting .. In case I have some updates I'll let you know! — Michele Orsi, Jun 04 '17 at 11:03
Agreed that this is a glaring hole in the GKE / Stackdriver stack. Pretty amazed that I can't find a way to set up alerts on when a pod restarts or gets evicted, or when a deployment is added, etc. Will probably end up writing my own python-based daemon to do this. (using this: https://github.com/kubernetes-client/python ) — JJC, Nov 16 '18 at 14:27

score 7 · Answer 1 · answered Dec 01 '20 at 22:23

7

There is a built in metric now, so it's easy to dashboard and/or alert on it without setting up custom metrics

Metric: kubernetes.io/container/restart_count
Resource type: k8s_container

answered Dec 01 '20 at 22:23

dan carter

4,158
1
33
34

This should be the way to do it now! – Dennis Gloss Dec 25 '20 at 22:04
1

Something changed since this comment was published. Now the alert often triggers for pods that are being terminated. Add a filter by `state=ACTIVE` to avoid this and only be alerted for container restarts in pods that are active. – Boyko Karadzhov Jun 22 '21 at 12:54

Jonathan Lin · Answer 2 · 2020-08-25T09:10:39.857

6

You can achieve this manually with the following:

In Logs Viewer, creating the following filter:

resource.labels.project_id="<PROJECT_ID>"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.namespace_name="<NAMESPACE, or default>"
jsonPayload.message:"failed liveness probe"

Create a metric by clicking on the Create Metric button above the filter input and filling up the details.
You may now track this metric in Stackdriver.

Would be happy to be informed of a built-in metric instead of this.

edited Aug 25 '20 at 09:10

answered Jan 04 '19 at 06:31

Jonathan Lin

19,922
7
69
65

1

for the payload you probably want ("Killing container" AND "Container failed liveness probe") otherwise you are going to match the autoscaler terminating pods when load reduces. – dan carter Feb 10 '19 at 19:37
Do you know how to automatically resolve an alert based on this method? – 6utt3rfly Nov 19 '19 at 23:57
Now it seems to be "Container product failed liveness probe, will be restarted" – unludo Jan 15 '20 at 16:25
You should filter on resource too otherwise your metric is going to be scanning every single log message on your cluster namespace `resource.type="k8s_pod"` – dan carter Dec 01 '20 at 21:46
I also find it useful to add a metric label on the container name as grouping by transient pod name is not so useful. Field: jsonPayload.message RegEx: Container ([^\s\\]*) – dan carter Dec 01 '20 at 22:08

score 5 · Answer 3 · answered Jul 11 '17 at 03:57

5

In my cluster (a bare-metal k8s cluster)，I use kube-state-metrics https://github.com/kubernetes/kube-state-metrics to do what you want. This project belongs to kubernetes repo and it is quite easy to use. Once deployed u can use kube_pod_container_status_restarts this metrics to know if a container restarts

answered Jul 11 '17 at 03:57

WizardCXY

51
1
4

I just installed kube-state-metrics on my dev cluster and this stat is missing. No other useful stats re Pod state seem available, actually. The words "restart", "terminate", "evict", "image", nor "backoff" are nowhere to be seen in the returned 12k metrics. :facepalm: – JJC Nov 16 '18 at 14:31
Weird, I can see the restart metric in the repo. https://github.com/kubernetes/kube-state-metrics/blob/17ddeca348130ead2f893295ea093429c388887c/internal/collector/pod.go#L488 – nhooyr Mar 06 '19 at 02:02

score 0 · Answer 4 · answered Dec 30 '21 at 00:48

Others have commented on how to do this with metrics, which is the right solution if you have a very large number of crashing pods.

An alernative approach is to treat crashing pods as discrete events or even log-lines. You can do this with Robusta (disclaimer, I wrote this) with YAML like this:

triggers:
  - on_pod_update: {}
actions:
  - restart_loop_reporter:
      restart_reason: CrashLoopBackOff
  - image_pull_backoff_reporter:
      rate_limit: 3600
sinks:
  - slack

Here we're triggering an action named restart_loop_reporter whenever a pod updates. The data stream comes from the APIServer.

The restart_loop_reporter is an action which filters out non-crashing pods. Above it's configured to report only on CrashLoopBackOffs but you could remove that to report all crashes.

A benefit of doing it this way is that you can gather extra data about the crash automatically. For example, the above will fetch the pod's logs and forward them along with the crash report.

I'm sending the result here to Slack, but you could just as well send it to a structured output like Kafka (already builtin) or Stackdriver (not yet supported, but I can fix that if you like).

score -1 · Answer 5 · answered Jul 29 '20 at 01:28

-1

Remember that, you can always raise feature request if the options available are not enough.

answered Jul 29 '20 at 01:28

grimmjow_sms

340
3
14

Monitoring and alerting on pod status or restart with Google Container Engine (GKE) and Stackdriver

5 Answers5

Linked