12

I have a 5 node cluster(1-master/4-worker). Is it possible to configure a StatefulSet where I can make a pod(s) to run on a given node knowing it has sufficient capacity rather Kubernetes Scheduler making this decision?

Lets say, my StatefulSet create 4 pods(replicas: 4) as myapp-0,myapp-1,myapp-2 and myapp-3. Now what I am looking for is:

myapp-0 pod-- get scheduled over---> worker-1

myapp-1 pod-- get scheduled over---> worker-2

myapp-2 pod-- get scheduled over---> worker-3

myapp-3 pod-- get scheduled over---> worker-4

Please let me know if it can be achieved somehow? Because if I add a toleration to pods of a StatefulSet, it will be same for all the pods and all of them will get scheduled over a single node matching the taint.

Thanks, J

Jaraws
  • 581
  • 1
  • 7
  • 24
  • the question was asked in 2015, but today, I am under same situation. Had you got an approach for this? If you remember, can you please let me know? – Nish Oct 11 '20 at 08:34

5 Answers5

8

You can delegate responsibility for scheduling arbitrary subsets of pods to your own custom scheduler(s) that run(s) alongside, or instead of, the default Kubernetes scheduler.

You can write your own custom scheduler. A custom scheduler can be written in any language and can be as simple or complex as you need. Below is a very simple example of a custom scheduler written in Bash that assigns a node randomly. Note that you need to run this along with kubectl proxy for it to work.

SERVER='localhost:8001'

while true;

do

    for PODNAME in $(kubectl --server $SERVER get pods -o json | jq '.items[] | select(.spec.schedulerName == "my-scheduler") | select(.spec.nodeName == null) | .metadata.name' | tr -d '"')

;

    do

        NODES=($(kubectl --server $SERVER get nodes -o json | jq '.items[].metadata.name' | tr -d '"'))


        NUMNODES=${#NODES[@]}

        CHOSEN=${NODES[$[$RANDOM % $NUMNODES]]}

        curl --header "Content-Type:application/json" --request POST --data '{"apiVersion":"v1", "kind": "Binding", "metadata": {"name": "'$PODNAME'"}, "target": {"apiVersion": "v1", "kind"

: "Node", "name": "'$CHOSEN'"}}' http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/

        echo "Assigned $PODNAME to $CHOSEN"

    done

    sleep 1

done

Then just in your StatefulSet configuration file under specification section you will have to add schedulerName: your-scheduler line.

You can also use pod affinity:.

Example:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: redis-server
        image: redis:3.2-alpine

The below yaml snippet of the webserver statefuset has podAntiAffinity and podAffinity configured. This informs the scheduler that all its replicas are to be co-located with pods that have selector label app=store. This will also ensure that each web-server replica does not co-locate on a single node.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-server
spec:
  selector:
    matchLabels:
      app: web-store
  replicas: 3
  template:
    metadata:
      labels:
        app: web-store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-store
            topologyKey: "kubernetes.io/hostname"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:1.12-alpine

If we create the above two deployments, our three node cluster should look like below.

node-1              node-2           node-3
webserver-1     webserver-2          webserver-3
cache-1             cache-2          cache-3

The above example uses PodAntiAffinity rule with topologyKey: "kubernetes.io/hostname" to deploy the redis cluster so that no two instances are located on the same host

You can simply define three replicas of specific pod and define particular pod configuration file, egg.: There is label: nodeName which is the simplest form of node selection constraint, but due to its limitations it is typically not used. nodeName is a field of PodSpec. If it is non-empty, the scheduler ignores the pod and the kubelet running on the named node tries to run the pod. Thus, if nodeName is provided in the PodSpec, it takes precedence over the above methods for node selection.

Here is an example of a pod config file using the nodeName field:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeName: kube-worker-1

More information about scheduler: custom-scheduler.

Take a look on this article: assigining-pods-kubernetes.

Malgorzata
  • 6,409
  • 1
  • 10
  • 27
5

You can use the following KubeMod ModRule:

apiVersion: api.kubemod.io/v1beta1
kind: ModRule
metadata:
  name: statefulset-pod-node-affinity
spec:
  type: Patch

  match:
    # Select pods named myapp-xxx.
    - select: '$.kind'
      matchValue: Pod
    - select: '$.metadata.name'
      matchRegex: myapp-.*

  patch:
    # Patch the selected pods such that their node affinity matches nodes that contain a label with the name of the pod.
    - op: add
      path: /spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution
      value: |-
        nodeSelectorTerms:
          - matchExpressions:
            - key: accept-pod/{{ .Target.metadata.name }}
              operator: In
              values:
                - 'true'

The above ModRule will monitor for the creation of pods named myapp-* and will inject a nodeAffinity section into their resource manifest before they get deployed. This will instruct the scheduler to schedule the pod to a node which has a label accept-pod/<pod-name> set to true.

Then you can assign future pods to nodes by adding labels to the nodes:

kubectl label node worker-1 accept-pod/myapp-0=true
kubectl label node worker-2 accept-pod/myapp-1=true
kubectl label node worker-3 accept-pod/myapp-2=true
...

After the above ModRule is deployed, creating the StatefulSet will trigger the creation of its pods, which will be intercepted by the ModRule. The ModRule will dynamically inject the nodeAffinity section using the name of the pod.

If, later on, the StatefulSet is deleted, deploying it again will lead to the pods being scheduled on the same exact nodes as they were before.

vassilvk
  • 196
  • 3
  • 5
1

You can do this using nodeSelector and node affinity (take a look at this guide https://kubernetes.io/docs/concepts/configuration/assign-pod-node/), anyone can be used to run pods on specific nodes. But if the node has taints (restrictions) then you need to add tolerations for those nodes (more can be found here https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/). Using this approach, you can specify a list of nodes to be used for your pod's scheduling, the catch is if you specify for ex. 3 nodes and you have 5 pods then you don't have control how many pods will run on each of these nodes. They gets distributed as per kube-schedular. Another relevant use case: If you want to run one pod in each of the specified nodes, you can create a daemonset and select nodes using nodeSelector.

Anmol Agrawal
  • 814
  • 4
  • 6
  • thanks for your reply. What I am looking for is to fix a node for an individual pod of a StatefulSet. Now, if I add tolerations to my container configurations in a StatefulSet if will be common for all pods of my StatefulSet and would schedule all pods on a node with matching taint. I have updated my question with more details. Kindly check. – Jaraws Feb 16 '20 at 07:42
  • Tolerations are for nodes which has taints. And in nodeSelector or pod affinity you provide node label. If you add same label to your worker nodes (worker-1 to worker-4) then all the pods will be distributed among them. You need to add tolerations only when any of these nodes have taints. – Anmol Agrawal Feb 16 '20 at 13:59
0

take a look to this guideline https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/ however, what you are looking for is the nodeSelector directive that should be placed in the pod spec.

cperez08
  • 709
  • 4
  • 9
  • Thanks for your reply @cperez08. What I am looking for is to fix a node for an individual pod of a StatefulSet. Now, if I add tolerations to my container configurations in a StatefulSet if will be common for all pods of my StatefulSet and would schedule all pods on a node with matching taint. I have updated my question with more details. Kindly check. – Jaraws Feb 16 '20 at 07:43
  • @Jaraws, in that case, I think that's not possible, the only thing you could do is schedule different Stafeulsets or Deployments in different nodes. – cperez08 Feb 16 '20 at 15:50
0

You can use podAntiAffinity to distribute replicas to different nodes.

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: "nginx"
  replicas: 4
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - nginx
            topologyKey: "kubernetes.io/hostname"

This would deploy web-0 in worker1 , web-1 in worker2, web-2 in worker3 and web-3 in worker4.

Arghya Sadhu
  • 41,002
  • 9
  • 78
  • 107