2

I have a 3-node cluster running on GKE. All the nodes are pre-emptible meaning they can be killed at any time and generally do not live longer than 24 hours. In the event a node is killed the autoscaler spins up a new node to replace it. This usually takes a minute or so when this happens.

In my cluster I have a deployment with its replicas set to 3. My intention is that each pod will be spread across all the nodes such that my application will still run as long as at least one node in my cluster is alive.

I've used the following affinity configuration such that pods prefer running on hosts different to ones already running pods for that deployment:

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - my-app
          topologyKey: kubernetes.io/hostname
        weight: 100

When I scale my application from 0 this seems to work as intended. But in practice the following happens:

  1. Lets say pods belonging to the my-app replicaset A, B and C are running on nodes 1, 2 and 3 respectively. So state would be:
  1 -> A
  2 -> B
  3 -> C
  1. Node 3 is killed taking pod C with it, resulting in 2 running pods in the replicaset.
  2. The scheduler automatically starts to schedule a new pod to bring the replicaset back up to 3.
  3. It looks for a node without any pods for my-app. As the autoscalar is still in the process of starting a replacement node (4), only 1 and 2 are available.
  4. It schedules the new pod D on node 1
  5. Node 4 eventually comes online but as my-app has all its pods scheduled it doesn't have any pods running on it. Resultant state is
  1 -> A, D
  2 -> B
  4 -> -

This is not the ideal configuration. The problem arises because there's a delay creating the new node and the schedular is not aware that it'll be available very soon.

Is there a better configuration that can ensure the pods will always be distributed across the node? I was thinking a directive like preferredDuringSchedulingpreferredDuringExecution might do it but that doesn't exist.

harryg
  • 23,311
  • 45
  • 125
  • 198

1 Answers1

2

preferredDuringSchedulingIgnoredDuringExecution means it is a preference not a hard requirement, which could explain 1 -> A, D

I believe you are searching for requiredDuringSchedulingIgnoredDuringExecution in conjunction with anti-affinity such that you have distributed workloads.

Please have a look at this github for more details and examples.

dany L
  • 2,456
  • 6
  • 12
  • This could work. Am I right in thinking that in the case described above the pod will be in `pending` state until a new node becomes available? Also, what if I wanted more than one pod per node. Wouldn't this rule prevent this? – harryg Nov 05 '19 at 16:33
  • 2
    This only works in a 1:1 strategy. A different approach will need to be used if you want more of the same pods on 1 node. – dany L Nov 05 '19 at 16:41
  • Indeed @danyL - care to expand on this? – mirekphd Nov 19 '21 at 19:29
  • 1
    Look into those terms, pod affinity attracts a pod to certain pods. node affinity attracts a pod to certain nodes. pod anti-affinity repels a pod from certain pods. there is also node anti-affinity. you will need to research the best combination that suits your needs. – dany L Nov 20 '21 at 15:05
  • Thanks a lot @danyL for the `pod[Anti]Affinity` suggestion (I have converged on the same two-affinities solution in my own question here: https://stackoverflow.com/a/70041931/9962007). I think I will also have to switch to `StatefulSets` to ensure ordered startup of "masters" before "slaves" that have `nodeAffinity` pinned to those "masters" location. – mirekphd Nov 21 '21 at 13:59