Small pods starving big pods on an oversubscribed Kubernetes cluster

Question

Our group has recently set up a 3-node Kubernetes cluster, and we've been using Jobs to schedule batch processing tasks on it. We have a lot of work to do and not a particularly large cluster to do it on, so at any given time there are a bunch of "pending" pods waiting to run on the cluster.

These pods have different resource requests; some are much larger than others. For example, some pods need 4 GB RAM and some need 100 GB RAM.

The problem we are having is that our large pods are never actually being run as long as there are enough small pods available to keep the cluster busy. As soon as one 4 GB pod finishes, Kubernetes looks and sees that a 4 GB pod will fit while a 100 GB pod won't, and it schedules a new 4 GB pod. It doesn't seem to ever decide that a 100 GB pod has been waiting long enough and refrain from scheduling new pods on a particular node until enough have finished that the 100 GB pod will fit there. Perhaps it can't tell that our pods come from jobs and are expected to eventually finish, unlike, say, a web server.

How can Kubernetes be configured to ensure that small pods cannot starve big pods indefinitely? Is there some kind of third-party scheduler with this behavior that we need to add to our installation? Or is there some way to configure the default scheduler to avoid this behavior?

I don't think there is a great way to handle this. "Time pending" does not appear to be a weighting factor for the scheduler. I can't say this for certain since my knowledge of scheduler internals is sketchy, but I think you may need to write your own scheduler to handle this. — coderanger, Dec 12 '19 at 05:35
As @coderanger mentioned you can create your own scheduler. You may also take a look at https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ and give higher priority to the pods with higher resource needs. — kool, Dec 12 '19 at 12:16
We're having the same problem in our group. How did you end up solving this problem @interfect? — Vaderico, Dec 12 '22 at 05:30
@Vaderico So far we really haven't, we just stopped running so many of the workloads where it was really apparent. I think we also relegated the small-jobs workloads to a fraction of the machines by restricting their namespace. I looked at the YuniKorn scheduler, which can do some slightly fancier multi-tenant queuing, but even that didn't have a real OS-thread-style fair scheduler (I think it used to and they turned it off). — interfect, Dec 13 '22 at 20:37

score 0 · Answer 1 · answered Apr 21 '23 at 07:48

You can use node affinity. Set a label on your node pools eg. for small nodes: example.com/nodeSize=4 and example.com/nodeSize=8 for big nodes

spec:
    affinity: 
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: "example.com/nodeSize"
                operator: In
                values:
                - "4"
          - weight: 40
            preference:
              matchExpressions:
              - key: "example.com/nodeSize"
                operator: In
                values:
                - "8"
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: "kubernetes.azure.com/mode"
                  operator: In
                  values:
                  - system
              topologyKey: kubernetes.io/hostname
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: "example.com/nodeSize"
                  operator: In
                  values:
                  - "8"
              topologyKey: kubernetes.io/hostname

Small pods starving big pods on an oversubscribed Kubernetes cluster

1 Answers1