21

I need to scale a set of pods that run queue-based workers. Jobs for workers can run for a long time (hours) and should not get interrupted. The number of pods is based on the length of the worker queue. Scaling would be either using the horizontal autoscaler using custom metrics, or a simple controller that changes the number of replicas.

Problem with either solution is that, when scaling down, there is no control over which pod(s) get terminated. At any given time, most workers are likely working on short running jobs, idle, or (more rare) processing a long running job. I'd like to avoid killing the long running job workers, idle or short running job workers can be terminated without issue.

What would be a way to do this with low complexity? One thing I can think of is to do this based on CPU usage of the pods. Not ideal, but it could be good enough. Another method could be that workers somehow expose a priority indicating whether they are the preferred pod to be deleted. This priority could change every time a worker picks up a new job though.

Eventually all jobs will be short running and this problem will go away, but that is a longer term goal for now.

Stragulus
  • 1,023
  • 11
  • 10
  • 1
    See possibly-related question: https://stackoverflow.com/questions/60924076/batch-processing-on-kubernetes – Stephen Nov 18 '20 at 21:01

6 Answers6

4

Since version 1.22 there is a beta feature that helps you do that. You can add the annotation controller.kubernetes.io/pod-deletion-cost with a value in the range [-2147483647, 2147483647] and this will cause pods with lower value to be killed first. Default is 0, so anything negative on one pod will cause a pod to get killed during downscaling, e.g.

kubectl annotate pods my-pod-12345678-abcde controller.kubernetes.io/pod-deletion-cost=-1000

Link to discussion about the implementation of this feature: Scale down a deployment by removing specific pods (PodDeletionCost) #2255

Link to the documentation: ReplicaSet / Pod deletion cost

gepa
  • 240
  • 1
  • 2
  • 5
2

During the process of termination of a pod, Kubernetes sends a SIGTERM signal to the container of your pod. You can use that signal to gracefully shutdown your app. The problem is that Kubernetes does not wait forever for your application to finish and in your case your app may take a long time to exit.
In this case I recommend you use a preStop hook, which is completed before Kubernetes sends the KILL signal to the container. There is an example here on how to use handlers:

apiVersion: v1
kind: Pod
metadata:
  name: lifecycle-demo
spec:
  containers:
  - name: lifecycle-demo-container
    image: nginx
    lifecycle:
      postStart:
        exec:
          command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
      preStop:
        exec:
          command: ["/bin/sh","-c","nginx -s quit; while killall -0 nginx; do sleep 1; done"]
victortv
  • 7,874
  • 2
  • 23
  • 27
  • 2
    This would not work for the envisioned solution as the long running jobs cannot resume from saved state; they'd have to start all over again. – Stragulus Apr 24 '19 at 19:38
  • You don't need necessarily to save the state in the preStop command, actually you can do anything you want in a custom script inside the container. Example: `command: ["/bin/sh","/myscript.sh;]`. What you can do is inside this script checking if the worker is idle or busy. In case of busy, wait some time and check again the status of the worker. After reaching the idle state, the script would finish and Kubernetes will kill the pod. Please correct me if I didn't understand correctly what you meant. – victortv Apr 24 '19 at 20:23
  • 1
    I see two issues with this approach: k8s docs say: "Users should make their hook handlers as lightweight as possible. There are cases, however, when long running commands make sense, such as when saving state prior to stopping a Container." - this could block for a long time (hours). Secondly, it also will give up on the hook after a given grace period. While that number could be set really high, it again seems like a bad practice. Do you have experience using this strategy and does it work well? – Stragulus Apr 24 '19 at 21:38
1

There is a kind of workaround that can give some control over the pod termination. Not quite sure if it the best practice, but at least you can try it and test if it suits your app.

  1. Increase the Deployment grace period with terminationGracePeriodSeconds: 3600 where 3600 is the time in seconds of the longest possible task in the app. This makes sure that the pods will not be terminated by the end of the grace period. Read the docs about the pod termination process in detail.
  2. Define a preStop handler. More details about lifecycle hooks can be found in docs as well as in the example. In my case, I've used the script below to create the file which will later be used as a trigger to terminate the pod (probably there are more elegant solutions).
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "touch /home/node/app/preStop"]
    
    
  3. Stop your app running as soon as the condition is met. When the app exits the pod terminates as well. It is not possible to end the process with PID 1 from preStop shell script so you need to add some logic to the app to terminate itself. In my case, it is a NodeJS app, there is a scheduler that is running every 30 seconds and checks whether two conditions are met. !isNodeBusy identifies whether it is allowed to finish the app and fs.existsSync('/home/node/app/preStop') whether preStop hook was triggered. It might be different logic for your app but you get the basic idea.
    schedule.scheduleJob('*/30 * * * * *', () => {
      if(!isNodeBusy && fs.existsSync('/home/node/app/preStop')){
        process.exit();
      }
    });
    

Keep in mind that this workaround works only with voluntary disruptions and obviously not helpful with involuntary disruptions. More info in docs.

Juniper
  • 712
  • 1
  • 12
  • 23
1

I think running this type of workload using a Deployment or similar, and using a HorizontalPodAutoscaler for scaling, is the wrong way to go. One way you could go about this is to:

  1. Define a controller (this could perhaps be a Deployment) whose task is to periodically create a Kubernetes Job object.
  2. The spec of the Job should contain a value for .spec.parallelism equal to the maximum number of concurrent executions you will accept.
  3. The Pods spawned by the Job then run your processing logic. They should each pull a message from the queue, process it, and then delete it from the queue (in the case of success).
  4. The Job must exit with the correct status (success or failure). This ensures that the Job recognises when the processing has completed, and so will not spin up additional Pods.

Using this method, .spec.parallelism controls the autoscaling based on how much work there is to be done, and scale-down is an automatic benefit of using a Job.

Matt Dunn
  • 5,106
  • 6
  • 31
  • 55
  • Sure, this causes scaling, but how does it prevent cluster autoscaling from killing off pods in one node to move them to another when it decides that the cluster is underutilized? – Stephen Feb 09 '22 at 15:12
  • @Stephen Apologies, the original question was asking about horizontal _pod_ autoscaling, not cluster autoscaling. The behaviour of cluster autoscaling is vendor-specific, but for example, on GKE a node can be prevented from deletion (and its pods prevented from eviction) if a pod's affinity/anti-affinity rules prevent rescheduling, amongst other things. I think this topic warrants a new question, with a bit more detail about your specific problem. – Matt Dunn Feb 09 '22 at 17:01
  • 1
    @Stephen And it's also worth pointing out that you can do all sorts of tricks with limits, requests, hooks and priorities, as others have mentioned. But at the end of the day, if a node is under heavy pressure (i.e. high memory/CPU load), then all things being equal, there is always the possibility that your critical pod might be chosen for eviction. So it's best to always code and plan for that eventuality in your app. – Matt Dunn Feb 09 '22 at 17:07
0

You are looking for Pod Priority and Preemption. By configuring a high priority PriorityClass for your pods you can ensure that they won't be removed to make space for other pods with a lower priority.

  1. Create a new PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class will not cause other pods to be preempted."
  1. Set your new PriorityClass in your pods
priorityClassName: high-priority

The value: 1000000 in the PriorityClass configures the scheduling priority of the pod. The higher the value the more important the pod is.

Lukas Eichler
  • 5,689
  • 1
  • 24
  • 43
  • But what if you want to just make sure that the pod isn't killed because the system determines the cluster is underutilized and decides to reduce the number of nodes by moving pods elsewhere? – Stephen Feb 09 '22 at 15:08
  • @Stephen Here you could apply a PodDisruptionBudget that sets the minimum available pods to the total number of pods. With this in place the cluster won't take any actions to move these pods. – Lukas Eichler Feb 09 '22 at 18:04
0

For those who lands on this page facing the issues of Pods getting killed while Node scaling down -

This is an expected feature of Cluster Autoscaler as CA will try to optimize the pods so that it could use a minimum size of the cluster. However, You can protect your Job pods from eviction (getting killed) by creating a PodDisruptionBudget with maxUnavailable=0 for them.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: sample-pdb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app: <your_app_name>
Arjit Sharma
  • 416
  • 5
  • 15
  • My read of the docs is that this is best-effort-only, meaning the PodDisruptionBudget will be respected if other options are available for pod scheduling, but if your long running pod is the only option for a pod that it's trying to schedule, it will violate your PodDisruptionBudget. See [ref](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#poddisruptionbudget-is-supported-but-not-guaranteed) – Andrew Schwartz Feb 03 '23 at 15:57