Almost two years later, we are experiencing the same issue as described in this SO post.
Our workloads had been working without any disruption since 2018, and they suddenly stopped because we had to renew certificates. Then we've not been able to start the workloads again... The failure is caused by the fact that pods try to mount a persistence disk via NFS, and the
nfs-server
pod (based on gcr.io/google_containers/volume-nfs:0.8
) can't mount the persistent disk.
We have upgraded from 1.23 to 1.25.5-gke.2000 (experimenting with a few intermediary previous) and hence have also switched to containerd
.
We have recreated everything multiple times with slight varioations, but no luck. Pods definitely cannot access any persistent disk.
We've checked basic things such as: the persistent disks and cluster are in the same zone as the GKE cluster, the service account used by the pods has the necessary permissions to access the disk, etc.
No logs are visible on, each pod, which is also strange since logging seems to be correctly configured.
Here is the nfs-server.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
role: nfs-server
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
role: nfs-server
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
role: nfs-server
spec:
containers:
- image: gcr.io/google_containers/volume-nfs:0.8
imagePullPolicy: IfNotPresent
name: nfs-server
ports:
- containerPort: 2049
name: nfs
protocol: TCP
- containerPort: 20048
name: mountd
protocol: TCP
- containerPort: 111
name: rpcbind
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /exports
name: webapp-disk
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- gcePersistentDisk:
fsType: ext4
pdName: webapp-data-disk
name: webapp-disk
status: {}