Pods stuck on "Terminating" after compute-node shutdown

Question

I am running an OCP4.6 with RHEL7.8 BareMetal compute nodes. We are running functionality and HA testing on the cluster. Our main application on this cluster is a StatefulSet with around 250 pods.

After shutting down a node, the pods running on the node entered a Terminating state, and are stuck there. Since this is a StatefulSet, pods cannot restart on another node until the original pod finishes terminating.

I can delete the pods with --force --grace-period=0 but this does not solve my issue.

These pods only terminate after the server that was shut down returns to Ready status.

Any ideas??

UPDATE:

Looking at k8s' docs - I found that the fact a StatefulSet pod doesn't terminate after a node shuts down is actually a saftey mechanism, and is in fact a feature: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

Can you check the logs of those pods while they are terminating? Also, what are the events when you try to `kubectl describe` them? — Wytrzymały Wiktor, Jan 21 '21 at 12:28
You may find it useful: https://stackoverflow.com/questions/50581744/a-solution-to-kubernetes-pods-stuck-on-terminating https://stackoverflow.com/questions/64029871/what-would-happen-if-i-restart-a-node-with-some-pods-running — VAS, Feb 02 '21 at 21:42

score 0 · Answer 1 · answered Jan 28 '21 at 12:58

If you want to avoid Pods being stuck when you shot down your Node you should try to Safely Drain a Node:

You can use kubectl drain to safely evict all of your pods from a node before you perform maintenance on the node (e.g. kernel upgrade, hardware maintenance, etc.). Safe evictions allow the pod's containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified.

When kubectl drain returns successfully, that indicates that all of the pods have been safely evicted (respecting the desired graceful termination period, and respecting the PodDisruptionBudget you have defined). It is then safe to bring down the node by powering down its physical machine or, if running on a cloud platform, deleting its virtual machine.

Also note that in case of Stuck evictions:

Abort or pause the automated operation. Investigate the reason for the stuck application, and restart the automation.

After a suitably long wait, DELETE the Pod from your cluster's control plane, instead of using the eviction API.

Kubernetes does not specify what the behavior should be in this case; it is up to the application owners and cluster owners to establish an agreement on behavior in these cases.

In order to investigate the stuck Pods you can:

Check the Pods' logs with kubectl logs ${POD_NAME}
kubectl describe pod and check its Events section
Debug with container exec

More details can be found in the linked docs.

This is the fix but I was looking for more of a reason. I managed to find an explanation for this issue, and managed to find in the K8s doc like updated in the thread. Thanks for all the information you supplied! — Linux Devops, Mar 03 '21 at 07:49

score 0 · Answer 2 · answered Jan 29 '21 at 15:48

Maybe you can check if your pod defines a "finalizer". Sometime a pod will not "terminate" because it is waiting for the "finalizer" action to finish but the situation is so that the finalizer can not run for whatever reason

If so, you can try to edit the pod and remove the "finalizer" section to see if your pod really goes away

Of course doing so may leave your apps in a bad state depending on what the finalizer was supposed to do

Some links:

Pods stuck on "Terminating" after compute-node shutdown

2 Answers2