Altering on container status with Stackdriver and GKE

Question

I must surely be missing something obvious. GCP provides me with all sorts of visible indications when a container has failed to start. For example:

But I cannot for the life of me figure out how to make it issue an alert when the container status is not OK.

How is it done?

Hello Sam, can you check if https://stackoverflow.com/a/54034049/12524159 answers your question? there is a good example on how to create alerts based on pod events. If not, specify which kind of events do you want to create an alert and I can help you building it. — Will R.O.F., Jul 10 '20 at 10:22
@willrof Thanks for the reply. I think that answer makes sense, but I can't find what log entries I should be looking for for missing minimum availability or CrashLoopBackOff etc. — Sam Stickland, Jul 10 '20 at 20:17

score 2 · Answer 1 · answered Jul 17 '20 at 16:39

CrashLoopBackOff indicates that a container is repeatedly crashing after restarting. A container might crash for many reasons, and checking a Pod's logs might aid in troubleshooting the root cause.

Apart from the error text message Does not have minimum availability, there could be other error text messages such as Failed to pull image. However, I recommend you to identify error text messages which are appropriate for your environment. You can check with kubectl logs <pod_name> or on Log Viewer.

For your reference, here are explanations for pod issues:

CrashLoopBackOff means the container was downloaded but failed to run
ImagePullBackOff means the image was not downloaded
"Does not have minimum availability" means that there are no resources available on cluster but not specific to a lack of resources. For instance there maybe nodes available but the pod is not scheduleable on them per the deployment.
"Insufficient cpu" means there is insufficient cpu on the nodes.
"Unschedulable" indicates that your Pod cannot be scheduled because of insufficient resources or some configuration error.

With that in mind, Here is the step-by-step for creating a Log based Metric for later creating an alert based on it.

Setup a Logs-based Metric using the parameters:
```
resource.type="k8s_pod"
severity>=WARNING
unschedulable
```
You can replace the filter to something that is more appropriate for your case.
Create a label in the metric that will allow you to identify the pod that was unschedulable (or other status). This will also help with grouping when you create the alert for a failing pod.
In Stackdriver Monitoring, create an alert with the following parameters.
- Set the resource type to k8s_pod
- Set the metric to the one you created in step 1
- Set Group By to the pod_name (also created in step 1)
- In the advanced aggregation section set the aligner to sum and the Alignment Period to 5m (or what you thinks is more appropriate).
- Configure the condition triggers For to more than 1 minute to prevent the alert from firing over and over. This can also be configured per your requirement.

I hope this information is helpful, If you have any questions let me know in the comments.

Altering on container status with Stackdriver and GKE

1 Answers1