Kubernetes : How to ensure one pod gets scheduled on each worker node?

Question

Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once

Assume a cluster has the following nodes

Nodes

master/control-plane
worker-1
worker-2
worker-2

Pod I'm working with

apiVersion: v1
kind: Pod
metadata:
  name: log-scraper
spec:
  volumes:
  - name: container-log-dir
    hostPath:
      path: /var/log/containers
  containers:
    - image: "logScraper:latest"
      name: log-munger
      volumeMounts:
      - name: container-log-dir
        mountPath: /var/log/logging-app

Adding affinity to select only 'worker' nodes (or non-mater nodes)

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: "worker"
              operator: In
              values:
              - "true"

Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper

Question 2: What other manifests should be applied/added to achieve this?

Krishna Chaurasia · Answer 1 · 2021-02-27T18:37:46.713

9

You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.

edited Feb 27 '21 at 18:37

answered Feb 27 '21 at 18:35

Krishna Chaurasia

8,924
6
22
35

1

It looks you are trying to write a custom logging system with the above concept of affinity and Daemonsets provided the best use case for logging. See the [example](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#create-a-daemonset) used for configuring fluentd as daemonsets for providing the logging to the cluster. – Krishna Chaurasia Feb 28 '21 at 14:31
1

Exactly. Use the right abstraction (daemonset in this case) for the right problem and don't try to just hack something together using lower level primitives (ie: pods) – donhector Jan 05 '23 at 09:38

hagrawal7777 · Accepted Answer · 2023-03-14T21:05:13.690

Concept

There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".

Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.

Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity

Node affinity/antiaffinity

Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.

Inter-pod affinity/antiaffinity

Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.

Your Solution

Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        labelSelector:
          - matchExpressions:
            - key: worker
              operator: In
              values:
              - log-scraper

Read more over here, and especially go through example over here.

score 1 · Answer 3 · answered Dec 29 '22 at 05:28

Using Pod Topology Spread Constraints

Another way to do it is using Pod Topology Spread Constraints.

You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        id: foo-bar

topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.

To learn more, check out the documentation here and also this excellent blog post.