6

I am trying to experiment a 2 node cluster (will scale up later once I stabilize) for mongodb. This is using EKS. The 2 nodes are running in two different aws zones. The descriptor is as follows:

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: mongod
  labels:
    name: mongo-repl
spec:
  serviceName: mongodb-service
  replicas: 2
  selector:
    matchLabels:
      app: mongod
      role: mongo
      environment: test
  template:
    metadata:
      labels:
        app: mongod
        role: mongo
        environment: test
    spec:
      terminationGracePeriodSeconds: 15
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: failure-domain.beta.kubernetes.io/zone
                operator: In
                values:
                - ap-south-1a
                - ap-south-1b
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - mongod
              - key: role
                operator: In
                values:
                - mongo
              - key: environment
                operator: In
                values:
                - test
            topologyKey: kubernetes.io/hostname
      containers:
        .....

The objective here is to NOT schedule another pod on the same node where already a pod with labels - app=mongod,role=mongo,environment=test is running

When I am deploying the spec, only 1 set of mongo pod is getting created on one node.

ubuntu@ip-192-170-0-18:~$ kubectl describe statefulset mongod
Name:               mongod
Namespace:          default
CreationTimestamp:  Sun, 16 Feb 2020 16:44:16 +0000
Selector:           app=mongod,environment=test,role=mongo
Labels:             name=mongo-repl
Annotations:        <none>
Replicas:           2 desired | 2 total
Update Strategy:    OnDelete
Pods Status:        1 Running / 1 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app=mongod
           environment=test
           role=mongo
  Containers:

kubectl describe pod mongod-1

Node:           <none>
Labels:         app=mongod
                controller-revision-hash=mongod-66f7c87bbb
                environment=test
                role=mongo
                statefulset.kubernetes.io/pod-name=mongod-1
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Pending
....
....
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  42s (x14 over 20m)  default-scheduler  0/2 nodes are available: 1 Insufficient pods, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.

0/2 nodes are available: 1 Insufficient pods, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.

Unable to figure out what is conflicting in the affinity specs. I'll really appreciate some insight here !


Edit on Feb/21 : Added information on new error below

Based on the suggestions, I have now scaled the worker nodes and started receiving more clear error message --

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 51s (x554 over 13h) default-scheduler 0/2 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict.

So the main issue now (after scaling up worker nodes) turns out to be --

1 node(s) had volume node affinity conflict

Posting below my whole configuration artifacts again:

apiVersion: apps/v1beta1
    kind: StatefulSet
    metadata:
      name: mongod
      labels:
        name: mongo-repl
    spec:
      serviceName: mongodb-service
      replicas: 2
      selector:
        matchLabels:
          app: mongod
          role: mongo
          environment: test
      template:
        metadata:
          labels:
            app: mongod
            role: mongo
            environment: test
        spec:
          terminationGracePeriodSeconds: 15
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: failure-domain.beta.kubernetes.io/zone
                    operator: In
                    values:
                    - ap-south-1a
                    - ap-south-1b
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - mongod
                  - key: role
                    operator: In
                    values:
                    - mongo
                  - key: environment
                    operator: In
                    values:
                    - test
                topologyKey: kubernetes.io/hostname
          containers:
        - name: mongod-container
          .......
      volumes:
        - name: mongo-vol
          persistentVolumeClaim:
            claimName: mongo-pvc

PVC --

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mongo-pvc
spec:
  storageClassName: gp2-multi-az
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi

PV --

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
  name: db-volume-0
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: gp2-multi-az
  awsElasticBlockStore:
    volumeID: vol-06f12b1d6c5c93903
    fsType: ext4
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/zone
        #- key: topology.kubernetes.io/zone
          operator: In
          values:
          - ap-south-1a

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
  name: db-volume-1
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: gp2-multi-az
  awsElasticBlockStore:
    volumeID: vol-090ab264d4747f131
    fsType: ext4
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/zone
        #- key: topology.kubernetes.io/zone
          operator: In
          values:
          - ap-south-1b

Storage Class --

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: gp2-multi-az
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
  type: gp2
  fsType: ext4
allowedTopologies:
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - ap-south-1a
    - ap-south-1b

I don't want to opt for dynamic PVC.

As per @rabello's suggestion adding the below outputs --

kubectl get pods --show-labels
NAME       READY   STATUS    RESTARTS   AGE   LABELS
mongod-0   1/1     Running   0          14h   app=mongod,controller-revision-hash=mongod-5b4699fd85,environment=test,role=mongo,statefulset.kubernetes.io/pod-name=mongod-0
mongod-1   0/1     Pending   0          14h   app=mongod,controller-revision-hash=mongod-5b4699fd85,environment=test,role=mongo,statefulset.kubernetes.io/pod-name=mongod-1

kubectl get nodes --show-labels
NAME                                           STATUS   ROLES    AGE   VERSION              LABELS
ip-192-170-0-8.ap-south-1.compute.internal     Ready    <none>   14h   v1.14.7-eks-1861c5   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/nodegroup-image=ami-07fd6cdebfd02ef6e,eks.amazonaws.com/nodegroup=trl_compact_prod_db_node_group,failure-domain.beta.kubernetes.io/region=ap-south-1,failure-domain.beta.kubernetes.io/zone=ap-south-1a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-170-0-8.ap-south-1.compute.internal,kubernetes.io/os=linux
ip-192-170-80-14.ap-south-1.compute.internal   Ready    <none>   14h   v1.14.7-eks-1861c5   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/nodegroup-image=ami-07fd6cdebfd02ef6e,eks.amazonaws.com/nodegroup=trl_compact_prod_db_node_group,failure-domain.beta.kubernetes.io/region=ap-south-1,failure-domain.beta.kubernetes.io/zone=ap-south-1b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-170-80-14.ap-south-1.compute.internal,kubernetes.io/os=linux
Rajesh
  • 419
  • 1
  • 6
  • 20
  • That error message looks pretty confusing, but I'm guessing what it's saying is that one of the nodes has hit the maximum number of pods that can are allowed to be scheduled on it, and the other node already has another pod from this StatefulSet running so this pod can't be scheduled there due to your anti-affinity rules, as desired. I would start by confirming what the max number of pods is that are allowed to be scheduled on the node that is not running one of the StatefulSet pods, and how many pods are in fact running there. See here: https://stackoverflow.com/a/56969305/1061413 – Amit Kumar Gupta Feb 16 '20 at 20:17
  • Max pods? Where did you get that from? – suren Feb 17 '20 at 02:04
  • @Rajesh do you have 2 worker nodes, or 1 master, 1 worker? – suren Feb 17 '20 at 02:06
  • Can you run the command to show the nodes and labels. kubectl get nodes --show-labels – Subramanian Manickam Feb 17 '20 at 03:12
  • Many thanks for the suggestions. Responding to the comments 1 at a time ... @AmitKumarGupta -- "I would start by confirming what the max number of pods" -- the other worker node is not running any pod defined by me. So the entire capacity is available. I'll check the SOF post which you shared. – Rajesh Feb 17 '20 at 05:12
  • 1
    @suren -- "do you have 2 worker nodes, or 1 master, 1 worker?" -- 2 worker nodes – Rajesh Feb 17 '20 at 05:12
  • @SubramanianManickam -- here is the output for 2 worker nodes ( you may need to format it for readability ) ubuntu@ip-192-170-0-18:~$ kubectl get nodes --show-labels – Rajesh Feb 17 '20 at 05:12
  • ip-192-170-0-5.ap-south-1.compute.internal Ready 11d v1.14.7-eks-1861c5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/nodegroup-image=ami-07fd6cdebfd02ef6e,eks.amazonaws.com/nodegroup=trl_compact_prod_db_node_group,failure-domain.beta.kubernetes.io/region=ap-south-1,failure-domain.beta.kubernetes.io/zone=ap-south-1a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-170-0-5.ap-south-1.compute.internal,kubernetes.io/os=linux – Rajesh Feb 17 '20 at 05:13
  • ip-192-170-80-19.ap-south-1.compute.internal Ready 11d v1.14.7-eks-1861c5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.micro,beta.kubernetes.io/os=linux,eks.amazonaws.com/nodegroup-image=ami-07fd6cdebfd02ef6e,eks.amazonaws.com/nodegroup=trl_compact_prod_db_node_group,failure-domain.beta.kubernetes.io/region=ap-south-1,failure-domain.beta.kubernetes.io/zone=ap-south-1b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-170-80-19.ap-south-1.compute.internal,kubernetes.io/os=linux – Rajesh Feb 17 '20 at 05:14
  • @rajesh if you check the comment to the linked SO answer, they say that they found the max was 4 and 4 pods from the system namespace were running there, so that was the problem: https://stackoverflow.com/questions/52898067/aks-reporting-insufficient-pods/56969305#comment102065070_56969305 – Amit Kumar Gupta Feb 17 '20 at 06:04
  • You may wish to quickly check whether or not you’re having a similar issue – Amit Kumar Gupta Feb 17 '20 at 06:04
  • I've tested your configuration here and works fine! I'm using GKE instead AWS. But you can make a test and try to change your instance type for t2.small and see if it's works. – Mr.KoopaKiller Feb 17 '20 at 13:02
  • Thanks rabello & Amit . Looks like it's indeed the number of pod constraint kicking in. I just checked with what Amit suggested and the outcome is : allocatable: attachable-volumes-aws-ebs: "39" cpu: "1" ephemeral-storage: "19316009748" hugepages-2Mi: "0" memory: 904892Ki pods: "4" capacity: attachable-volumes-aws-ebs: "39" cpu: "1" ephemeral-storage: 20959212Ki hugepages-2Mi: "0" memory: 1007292Ki pods: "4" I'll have to rebuild the cluster with a larger shape and post back results. – Rajesh Feb 17 '20 at 13:35
  • @Rajesh, please post if worked after resize your cluster. – Mr.KoopaKiller Feb 18 '20 at 10:51
  • @AmitKumarGupta Unfortunately , it didn't work with a larger shape as well. allocatable: attachable-volumes-aws-ebs: "25" cpu: "2" ephemeral-storage: "19316009748" hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 1899960Ki pods: "11" capacity: attachable-volumes-aws-ebs: "25" cpu: "2" ephemeral-storage: 20959212Ki hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 2002360Ki pods: "11" ---- Still hitting the same issue !! – Rajesh Feb 20 '20 at 17:56
  • @rabello didn't work with a larger shape – Rajesh Feb 20 '20 at 17:59
  • Hey Rajesh, is the error message when you describe the pod still the same, ie does it still say 1 node has Insufficient pods? – Amit Kumar Gupta Feb 20 '20 at 20:28
  • @Rajesh please edit your question and include the output of the commands: "kubectl get pods --show-labels" and "kubectl get nodes --show-labels" – Mr.KoopaKiller Feb 21 '20 at 07:29
  • @AmitKumarGupta initially I thought it's the same error , however at a closer inspection , now it is a bit more clear -- 0/2 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict. I am going to edit my question and post all the other related artifacts – Rajesh Feb 21 '20 at 07:49
  • @rabello edited my question , please see "Edit on Feb/21" .. towards the bottom of this edit, I have posted the label info you asked for. – Rajesh Feb 21 '20 at 08:12
  • Ok, It's clear now! You are trying to mount the same EBS in both nodes in different availability zones. Actually, it's not possible. Recently AWS [released](https://aws.amazon.com/about-aws/whats-new/2020/02/ebs-multi-attach-available-provisioned-iops-ssd-volumes/) a a new feature that allow multi-attach EBS but only for instance in the same AZ. So you need to use another solution like EFS for achieve this. – Mr.KoopaKiller Feb 21 '20 at 10:28
  • @rabello , well actually I am using 2 different PVs for 2 different regions backed by 2 EBS volumes in respective regions. The whole idea of using replica-set spread across different availability zones is for availability/ fault tolerance for my statefulset. If this is not possible , then doesn't it defeat the whole purpose of availability zones ?? – Rajesh Feb 21 '20 at 14:02
  • Do you want the exact same volume mounted to all pods, or similar but separate volumes with each pod writing to its own individual PV? – Amit Kumar Gupta Feb 21 '20 at 14:23
  • Why don’t you want dynamic storage? The only way I can think to use a stateful set with pods spread across multiple nodes is to use PVC templates instead of a specific PVC. At deploy time k8s will dynamically create a PVC, and corresponding PV, for each pod. – Amit Kumar Gupta Feb 21 '20 at 14:25
  • @AmitKumarGupta , "Do you want the exact same volume ... " : to me it's 1 logical k8s volume backed by 2 EBS volumes in 2 AZs . "Why don’t you want dynamic storage?" : as far as I've understood, in case I have to bring down my cluster and start again, it may not be possible to start with the existing mongo db files if I use dynamic pvc / pvc template – Rajesh Feb 21 '20 at 15:16
  • @Rajesh , "1 logical k8s volume backed by 2 EBS volumes in 2 AZs" : so do you mean you would expect the exact same data to be on both volumes at every point in time? – Amit Kumar Gupta Feb 21 '20 at 15:26
  • @AmitKumarGupta Well I guess for mongo db the writes happen in primary and the secondaries replicate -- https://docs.mongodb.com/manual/core/replica-set-members/ . So yes both volumes will be in sync . – Rajesh Feb 21 '20 at 17:11
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/208272/discussion-between-amit-kumar-gupta-and-rajesh). – Amit Kumar Gupta Feb 21 '20 at 17:20

1 Answers1

0

EBS volumes are zonal. They can only be accessed by pods that are located in the same AZ as the volume. Your StatefulSet allows pods to be scheduled in multiple zones (ap-south-1a and ap-south-1b). Given your other constraints, the scheduler may be attempting to schedule a pod on node in different AZ than its volume. I would try confining your StatefulSet to a single AZ or use an operator to install Mongo.

Jeremy Cowan
  • 563
  • 4
  • 13