1

We have an autopilot cluster in GKE. Sometimes, our pods simply get terminated, with no explanation. We suspect that k8s is preempting our pods - we only have one DAG running on a daily schedule in this cluster, but it tries to run a number of tasks simultaneously and we think that if there aren't enough resources, k8s preempts an existing pod to start another.

Is there a way to test for this? Is there a way to configure GKE/k8s to be a little more patient when waiting for resources?

FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202

2 Answers2

0

Basically if resources are not requested or the specified resources are outside of allowed ranges, GKE Autopilot modifies the requested resources to ensure that they are within the limits of the resources available. Otherwise autopilot doesn't schedule the pods.

In your case the autopilot might have modified the pod's resources to match the minimum resource limit. So, It is always recommended to provide required resources in workload manifests. To avoid these issues you may need to consider the (Horizontal Pod Autoscaling) HPA in GKE Autopilot.

Refer this document for more detailed information about setting the resources limit in Autopilot.

Refer this document for detailed information about Automatic resource management in GKE autopilot.

0

After some discussion within the team and also with a Google support engineer, we added some "warm-up" tasks to our DAG. These tasks are just simple Python tasks that sleep for some period of time (6 minutes seems to be just enough time) so that the cluster can wake up and start running its own pods. If it needs to preempt something, it preempts a warm-up task, and that's OK.

Since implementing this, we haven't had any real tasks get preempted.

FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202