0

I have the following PVC :

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium-delete
  resources:
    requests:
      storage: 50Gi

This PVC is used by two workloads :

A statefulset

apiVersion: apps/v1
kind: StatefulSet
metadata:  
  name: foo
  labels:
    component: foo
spec:
  template:   
    spec:   
      containers:
      - image: foo:1.0.0
        name: foo
        volumeMounts:
        - mountPath: /a/specific/path
          name: shared
          readOnly: true
      volumes:
      - name: shared
        persistentVolumeClaim:
          claimName: my-pvc

A deployment

apiVersion: apps/v1
kind: Deployment
metadata:       
  name: bar
spec:
  replicas: 1
  template:            
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                component: foo
            topologyKey: kubernetes.io/hostname
      containers:
      - image: bar:1.0.0
        name: bar
        volumeMounts:
        - mountPath: /a/specific/path
          name: shared
          readOnly: true
      volumes:
      - name: shared
        persistentVolumeClaim:
          claimName: my-pvc

If Pod A and Pod B are not on the same node, volume cannot be mounted by one of the pod.

If Pod A and Pod B are referencing each other with affinity, when pod A and pod B (re)start at the same time kubelet cannot schedule them (circular dependency).

If Pod A and Pod B are referencing a specific node with affinity, what happens if the node is decommissioned when cluster scales down ?

How to ensure my foo and bar workload always start on the same node as they are sharing a PVC ?

Will
  • 1,792
  • 2
  • 23
  • 44
  • Why do they need to be on the same node? If the only reason is that the underlying volume storage is specific to a node, Kubernetes should be able to handle this case on its own. (But note that there are some significant scaling problems around this – if you need 20 replicas, will all of the containers fit on the same single node? – and you will need to protect against concurrent reads and writes from different containers; avoiding a shared filesystem might be a better approach if possible.) – David Maze Mar 14 '23 at 10:57
  • The volume is not specfic to a node, but if pod `foo` starts on node A and pod `bar` on node B, then the volume cannot be mounted by both pods. Using podAffinity between `foo` and `bar` fix that issue, but what if `foo` and `bar` start at the same time ? It does not work because they both would require the other one to be scheduled on a node. – Will Mar 14 '23 at 11:27

2 Answers2

0

It's not crashing due to schedule on another node issues. Either of one only starts due to ReadWriteOnce

accessModes:
  - ReadWriteOnce

If you want to share the PVC you have to create the PVC with ReadWriteMany permission so multiple PODs in write to single PVC and share it.

You have to use the NFS or Minio or other tools to create the PVC with accessMode ReadWriteMany

Read more about the access Mode : https://stackoverflow.com/a/57798369/5525824

Use Node affinity to schedule on affinity:

affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - antarctica-east1
            - antarctica-west1
Harsh Manvar
  • 27,020
  • 6
  • 48
  • 102
  • No, when both workload run on the same node it works as expected. The issue occurs if pod `foo` crashes (liveprobe's ko for example) and kubelet schedule `foo` on a different node than `bar` – Will Mar 14 '23 at 09:41
  • What I am looking for is a affinity-ish way of scheduling my workloads on the same node – Will Mar 14 '23 at 09:42
  • updated answer use node affinity to schedule pod on that, you can edit expression with node name etc. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity – Harsh Manvar Mar 14 '23 at 10:06
  • I though about this fix before but I feel like this is not an elegant solution. What if the node you are targeting with the selector is fully loaded with other pods ? I would like better to have an affinity to the node where the PV/PVC is attached at the moment rather than a node name. Unfortunately I cannot find a way to declare this. – Will Mar 14 '23 at 11:31
  • that's the default behavior if you dont set the node affinity of k8s. you can set affinity with PV also and also with deployment by this both will be on same nodes – Harsh Manvar Mar 14 '23 at 11:45
  • If Pod A and Pod B are referencing a specific node with affinity, what happens if the node is decommissioned when cluster scales down ? – Will Mar 14 '23 at 12:09
  • both pod will be pending state, however, depends on your node affinity rule how you are setting required/prefer etc, what condition you are setting to pv/deployment. you can set multiple rules in affinity so if nodes down everything get schedule on other available zone or node label. – Harsh Manvar Mar 14 '23 at 12:17
0

Haven't try this, but by looking into the documentation, the only possible solution might to set podAffinity. To have a backup solution incase of failing node, multiple podAffinity blocks can be added. First should be with hard preference but others with soft preference with different weights.

This might require substantial testing to ensure that this works as expected. More Details below.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#an-example-of-a-pod-that-uses-pod-affinity

Amit
  • 21
  • 4