Are Kubernetes liveness probe failures voluntary or involuntary disruptions?

Question

I have an application deployed to Kubernetes that depends on an outside application. Sometimes the connection between these 2 goes to an invalid state, and that can only be fixed by restarting my application.

To do automatic restarts, I have configured a liveness probe that will verify the connection.

This has been working great, however, I'm afraid that if that outside application goes down (such that the connection error isn't just due to an invalid pod state), all of my pods will immediately restart, and my application will become completely unavailable. I want it to remain running so that functionality not depending on the bad service can continue.

I'm wondering if a pod disruption budget would prevent this scenario, as it limits the # of pods down due to a "voluntary" disruption. However, the K8s docs don't state whether liveness probe failure are a voluntary disruption. Are they?

Is it feasible to migrate the outside application to your Kubernetes cluster? This won't directly answer your question but I'd reckon you could take a look on following articles: [1](https://blog.risingstack.com/designing-microservices-architecture-for-failure/), [2](https://loft.sh/blog/kubernetes-readiness-probes-examples-common-pitfalls/#external-dependencies), [3](https://cloud.google.com/architecture/scalable-and-resilient-apps#resilience_designing_to_withstand_failures) — Dawid Kruk, Apr 27 '21 at 13:36
@DawidKruk the outside application unfortunately is Azure CosmosDB, which doesn't have the best client drivers for the environment I'm in. However, I've had similar issues using ioredis to connect to my self-hosted Redis cluster, so I'll take a look. Thanks! — roim, Apr 27 '21 at 20:33

score 1 · Accepted Answer · answered May 13 '21 at 11:33

I would say, accordingly to the documentation:

Voluntary and involuntary disruptions

Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.

We call these unavoidable cases involuntary disruptions to an application. Examples are:

a hardware failure of the physical machine backing the node

cluster administrator deletes VM (instance) by mistake

cloud provider or hypervisor failure makes VM disappear

a kernel panic

the node disappears from the cluster due to cluster network partition

eviction of a pod due to the node being out-of-resources.

Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.

We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:

deleting the deployment or other controller that manages the pod

updating a deployment's pod template causing a restart

directly deleting a pod (e.g. by accident)

Cluster administrator actions include:

Draining a node for repair or upgrade.

Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).

Removing a pod from a node to permit something else to fit on that node.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Disruptions

So your example is quite different and according to my knowledge it's neither voluntary or involuntary disruption.

Also taking a look on another Kubernetes documentation:

Pod lifetime

Like individual application containers, Pods are considered to be relatively ephemeral (rather than durable) entities. Pods are created, assigned a unique ID (UID), and scheduled to nodes where they remain until termination (according to restart policy) or deletion. If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period.

Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the work of managing the relatively disposable Pod instances.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod lifecycle: Pod lifetime

Container probes

The kubelet can optionally perform and react to three kinds of probes on running containers (focusing on a livenessProbe):

livenessProbe: Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy. If a Container does not provide a liveness probe, the default state is Success.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod lifecycle: Container probes

When should you use a liveness probe?

If the process in your container is able to crash on its own whenever it encounters an issue or becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will automatically perform the correct action in accordance with the Pod's restartPolicy.

If you'd like your container to be killed and restarted if a probe fails, then specify a liveness probe, and specify a restartPolicy of Always or OnFailure.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod lifecycle: When should you use a startup probe

According to those information it would be better to create custom liveness probe which should consider internal process health checks and external dependency(liveness) health check. In the first scenario your container should stop/terminate your process unlike the the second case with external dependency.

Answering following question:

I'm wondering if a pod disruption budget would prevent this scenario.

In this particular scenario PDB will not help.

I'd reckon giving more visibility to the comment, I've made with additional resources on the matter could prove useful to other community members:

Alpha · Answer 2 · 2021-11-01T02:48:30.253

Testing with PodDisruptionBudget. Pod will still restart at the same time.

example

https://github.com/AlphaWong/PodDisruptionBudgetAndPodProbe

So yes. like @Dawid Kruk u should create a customized script like following

# something like this
livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    # generate a random number for sleep
    - 'SLEEP_TIME=$(shuf -i 2-40 -n 1);sleep $SLEEP_TIME; curl -L --max-time 5 -f nginx2.default.svc.cluster.local'
  initialDelaySeconds: 10
  # think about the gap between each call
  periodSeconds: 30
  # it is required after k8s v1.12
  timeoutSeconds: 90

Gupta · Answer 3 · 2021-04-27T08:46:01.937

0

I'm wondering if a pod disruption budget would prevent this scenario.

Yes, it will prevent.

As you stated, when the pod goes down (or node failure) nothing can do pods from becoming unavailable. However, Certain services require that a minimum number of pods always keep running always.

There could be another way (Stateful resource) but it’s one of the simplest Kubernetes resources available.

Note: You can also use a percentage instead of an absolute number in the minAvailable field. For example, you could state that 60% of all pods with the app=run-always label need to be running at all times.

edited Apr 27 '21 at 08:46

answered Apr 27 '21 at 08:33

Gupta

8,882
4
49
59

OP asked about voluntary disruptions. From their question it looks like they know about PDB already. – zerkms Apr 27 '21 at 08:38
yes, they know but as per his req. This is the minimum effort to achieve. I edited my answer. Let him come back and share his opinion.. ALso, this is something that needs mew research than a straight answer. Agreed? – Gupta Apr 27 '21 at 08:44
"Yes, it will prevent." --- are you sure PDB prevents liveness probes from restarting the pod? – zerkms Apr 27 '21 at 08:46
@Gupta I need to agree with you on your last comment. Could you please edit your answer to support the last comment you've made (with `loose-loose` situation) so that it would be more visible and explain the circumstances more clearly? – Dawid Kruk Apr 27 '21 at 13:19
Edit Comments- _I did not mean that. This is something need to handle at the application level. On one side PDB tries to make pod to be ready as per its nature of implementation and on the other hand Liveness Probe will try to restart the pod in order to make it healthy._ – Gupta Apr 27 '21 at 13:37
@Gupta do you have any references or tests that prove this? Otherwise, I might have to run my own in the near future – roim Apr 27 '21 at 21:35
For reference, I would say to look out Kubernetes in Action book https://www.manning.com/books/kubernetes-in-action?query=kuber chapter 15 – Gupta Apr 28 '21 at 03:47

Are Kubernetes liveness probe failures voluntary or involuntary disruptions?

3 Answers3

Voluntary and involuntary disruptions

Pod lifetime

Container probes

When should you use a liveness probe?

example

Linked