First of all: I readed other posts like this.
My staging cluster is allocated on AWS using spot instances.
I have arround 50+ pods (runing diferent services / products) and 6 StatefulSets.
I created the StatefulSets this way:
OBS: I do not have PVs and PVCs created manualy, they are being created from the StatfulSet
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
labels:
app: redis
spec:
selector:
matchLabels:
app: redis
serviceName: "redis"
replicas: 1
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:alpine
imagePullPolicy: Always
ports:
- containerPort: 6379
name: client
volumeMounts:
- name: data
mountPath: /data
readOnly: false
volumeClaimTemplates:
- metadata:
name: data
labels:
name: redis-gp2
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: redis
labels:
app: redis
spec:
ports:
- port: 6379
name: redis
targetPort: 6379
selector:
app: redis
type: NodePort
I do have node and pod autoscalers configured.
In the past week after deploying some extra micro-services during the "usage peak" the node autoscaler trigged.
During the scale down some pods(StatefulSets) crashed with the error node(s) had volume node affinity conflict
.
My first reaction wast to delete and "recreate" the PVs/PVCs with high priority. That "fixed" the pending pods on that time.
Today I forced another scale-up, so I was able to check what was happening.
The problem occurs during the scalle up and take a long time to go back to normal (+/- 30 min) even after the scalling down.
Describe Pod:
Name: redis-0
Namespace: ***-staging
Priority: 1000
Priority Class Name: prioridade-muito-alta
Node: ip-***-***-***-***.sa-east-1.compute.internal/***.***.*.***
Start Time: Mon, 03 Jan 2022 09:24:13 -0300
Labels: app=redis
controller-revision-hash=redis-6fd5f59c5c
statefulset.kubernetes.io/pod-name=redis-0
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: ***.***.***.***
IPs:
IP: ***.***.***.***
Controlled By: StatefulSet/redis
Containers:
redis:
Container ID: docker://4928f38ed12c206dc5915c863415d3eba98b9592f2ab5c332a900aa2fa2cef64
Image: redis:alpine
Image ID: docker-pullable://redis@sha256:4bed291aa5efb9f0d77b76ff7d4ab71eee410962965d052552db1fb80576431d
Port: 6379/TCP
Host Port: 0/TCP
State: Running
Started: Mon, 03 Jan 2022 09:24:36 -0300
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/data from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-ngc7q (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-redis-0
ReadOnly: false
default-token-***:
Type: Secret (a volume populated by a Secret)
SecretName: *****
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 59m (x4 over 61m) default-scheduler 0/7 nodes are available: 1 Too many pods, 1 node(s) were unschedulable, 5 node(s) had volume node affinity conflict.
Warning FailedScheduling 58m default-scheduler 0/7 nodes are available: 1 Too many pods, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1641210902}, that the pod didn't tolerate, 1 node(s) were unschedulable, 4 node(s) had volume node affinity conflict.
Warning FailedScheduling 58m default-scheduler 0/7 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1641210902}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Too many pods, 3 node(s) had volume node affinity conflict.
Warning FailedScheduling 57m (x2 over 58m) default-scheduler 0/7 nodes are available: 2 Too many pods, 2 node(s) were unschedulable, 3 node(s) had volume node affinity conflict.
Warning FailedScheduling 50m (x9 over 57m) default-scheduler 0/6 nodes are available: 1 node(s) were unschedulable, 2 Too many pods, 3 node(s) had volume node affinity conflict.
Warning FailedScheduling 48m (x2 over 49m) default-scheduler 0/5 nodes are available: 2 Too many pods, 3 node(s) had volume node affinity conflict.
Warning FailedScheduling 35m (x10 over 48m) default-scheduler 0/5 nodes are available: 1 Too many pods, 4 node(s) had volume node affinity conflict.
Normal NotTriggerScaleUp 30m (x163 over 58m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had volume node affinity conflict
Warning FailedScheduling 30m (x3 over 33m) default-scheduler 0/5 nodes are available: 5 node(s) had volume node affinity conflict.
Normal SuccessfulAttachVolume 29m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-23168a78-2286-40b7-aa71-194ca58e0005"
Normal Pulling 28m kubelet, ip-***-***-***-***.sa-east-1.compute.internal Pulling image "redis:alpine"
Normal Pulled 28m kubelet, ip-***-***-***-***.sa-east-1.compute.internal Successfully pulled image "redis:alpine" in 3.843908086s
Normal Created 28m kubelet, ip-***-***-***-***.sa-east-1.compute.internal Created container redis
Normal Started 28m kubelet, ip-***-***-***-***.sa-east-1.compute.internal Started container redis
PVC:
Name: data-redis-0
Namespace: ***-staging
StorageClass: gp2
Status: Bound
Volume: pvc-23168a78-2286-40b7-aa71-194ca58e0005
Labels: app=redis
name=redis-gp2
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
volume.kubernetes.io/selected-node: ip-***-***-***-***.sa-east-1.compute.internal
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Mounted By: redis-0
Events: <none>
PV:
Name: pvc-23168a78-2286-40b7-aa71-194ca58e0005
Labels: failure-domain.beta.kubernetes.io/region=sa-east-1
failure-domain.beta.kubernetes.io/zone=sa-east-1b
Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers: [kubernetes.io/pv-protection]
StorageClass: gp2
Status: Bound
Claim: ***-staging/data-redis-0
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 1Gi
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/zone in [sa-east-1b]
failure-domain.beta.kubernetes.io/region in [sa-east-1]
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://sa-east-1b/vol-061fd23a65185d42c
FSType: ext4
Partition: 0
ReadOnly: false
Events: <none>
This happend in 4 of my 6 StatefulSets.
Question:
If I create PVs and PVCs manually setting:
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- sa-east-1
will the scale up/down not mess up with StatefulSets?
If not what can I do to avoid this problem ?