3

Our GKE Autopilot cluster was recently upgraded to version 1.21.6-gke.1503, which apparently causes the cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation to be banned.

I totally get this for deployments, as Google doesn't want a deployment preventing scale-down, but for jobs I'd argue this annotation makes perfect sense in certain cases. We start complex jobs that start and monitor other jobs themselves, which makes it hard to make them restart-resistant given the sheer number of moving parts.

Is there any way to make it as unlikely as possible for job pods to be restarted/moved around when using Autopilot? Prior to switching to Autopilot, we used to make sure our jobs filled a single node by requesting all of its available resources; combined with a Guaranteed QoS class, this made sure the only way for a pod to be evicted was if the node somehow failed, which almost never happened. Now all we seem to have left is the Guaranteed QoS class, but that doesn't prevent pods from being evicted.

PLPeeters
  • 1,009
  • 12
  • 26
  • Please update your question with some examples and/or reproduction steps; maybe yaml files etc. The more details the better. – Wojtek_B Mar 17 '22 at 13:45
  • @Wojtek_B It's hard to provide examples for a case like this, as the jobs are pretty complex and the issue seems unpredictable. It does seem to happen more often when the cluster is running many (parent) jobs in parallel, so it could be related to resources becoming available when some pods terminate. So let's look at this from a theoretical standpoint: if we have jobs A and B on node 1 and job C on node 2, each job consuming 50% of the resources of a node. If job B terminates, is there any point where the scheduler will move job C to node 1 so it can terminate node 2 and save resources? – PLPeeters Mar 18 '22 at 14:28
  • The above scenario is possible. Have you looked into using PDBs? – Gari Singh Mar 18 '22 at 23:14
  • @GariSingh Nope. I'm guessing creating a PDB with `maxUnavailable: 0` matching say a `prevent-eviction` label and setting that label on my job pods would work? But then a) how is this different from the `safe-to-evict` annotation and why not allow it in Autopilot the first place? and b) what guarantees do I have that this won't end up being banned in Autopilot as well? – PLPeeters Mar 21 '22 at 09:58
  • @GariSingh After investigating PDBs it turns out they don't work with jobs, so unless I missed something that won't work. – PLPeeters Mar 21 '22 at 10:37
  • Did you try adding [`nodeSelector` field](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#step-two-add-a-nodeselector-field-to-your-pod-configuration) ? – Wojtek_B Mar 21 '22 at 10:47
  • @Wojtek_B Unless I'm missing something, I don't see how having a node selector would help us avoid evictions, since AFAIK node selectors only tell the scheduler which nodes we want or don't want a pod to run on? – PLPeeters Mar 21 '22 at 12:30
  • @PLPeeters - I think you can use PDBs with Jobs but you can only use `minAvailable` – Gari Singh Mar 22 '22 at 08:37
  • @GariSingh `minAvailable` doesn't work for our use case. Our jobs are spawned on-demand with unique names, meaning this would require a different PDB for each job (otherwise the PDB would span across multiple jobs, rendering it useless). We really need a solution for this as right now running multiple jobs in parallel leads to mayhem due to pods being moved around. We'd really like to avoid having to abandon Autopilot because of this. – PLPeeters Mar 22 '22 at 10:55
  • I'd hate to see you abandon Autopilot as well. Working with others on the team to explore other options. – Gari Singh Mar 22 '22 at 18:25
  • Is there a maximum amount of time the Job needs to run for? For example, if you could set a large enough `terminationGracePeriodSeconds` would that suffice? – William Denniss Mar 22 '22 at 18:25
  • @WilliamDenniss Depending on the job it can range anywhere from 10 minutes to multiple days. Apparently there's no max value in Kubernetes (would have to check for GKE), but then if I set `terminationGracePeriodSeconds` to the maximum possible value, it's basically the same as using the `safe-to-evict` annotation, so why not just allow that one in the first place? And since it would be virtually the same resource-hogging wise as the annotation, I'm concerned about a future ban/cap on that as well since it would lead to the same problem. – PLPeeters Mar 23 '22 at 08:34
  • In addition, when people search for a solution to avoid pod eviction, the `safe-to-evict` annotation is the answer they'll find most often, so in that regard it also makes sense for Autopilot to support it IMO, since its aim seems to be making Kubernetes use as frictionless as possible. – PLPeeters Mar 23 '22 at 08:35
  • @WilliamDenniss Unfortunately after some tests `terminationGracePeriodSeconds` is not a viable solution either as the autoscaler sets a 10-minute override when scaling down a node. I had a pod with a 7-day grace period and when it got terminated for node scale-down the grace period was set to 600 seconds. – PLPeeters Mar 25 '22 at 09:47

2 Answers2

1

At this point the only thing left is to ask to bring back this feature on IssueTracker - raise a new feature reqest and hope for the best.

Link to this thread also as it contains quite a lot of troubleshooting and may be useful.

Wojtek_B
  • 4,245
  • 1
  • 7
  • 21
  • [Issue created](https://issuetracker.google.com/issues/227162588) as a bug, since this worked before and no longer does. – PLPeeters Mar 28 '22 at 15:06
0

This is now supported in GKE Autopilot, from 1.27+.

cluster-autoscaler.kubernetes.io/safe-to-evict=false will prevent GKE-initiated disruption to the Pod for 7 days (including auto-scaling related, and update related disruption).

William Denniss
  • 16,089
  • 7
  • 81
  • 124