1

Almost two years later, we are experiencing the same issue as described in this SO post.

Our workloads had been working without any disruption since 2018, and they suddenly stopped because we had to renew certificates. Then we've not been able to start the workloads again... The failure is caused by the fact that pods try to mount a persistence disk via NFS, and the nfs-server pod (based on gcr.io/google_containers/volume-nfs:0.8) can't mount the persistent disk.

We have upgraded from 1.23 to 1.25.5-gke.2000 (experimenting with a few intermediary previous) and hence have also switched to containerd.

We have recreated everything multiple times with slight varioations, but no luck. Pods definitely cannot access any persistent disk.

We've checked basic things such as: the persistent disks and cluster are in the same zone as the GKE cluster, the service account used by the pods has the necessary permissions to access the disk, etc.

No logs are visible on, each pod, which is also strange since logging seems to be correctly configured.

Here is the nfs-server.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    role: nfs-server
  name: nfs-server
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - image: gcr.io/google_containers/volume-nfs:0.8
        imagePullPolicy: IfNotPresent
        name: nfs-server
        ports:
        - containerPort: 2049
          name: nfs
          protocol: TCP
        - containerPort: 20048
          name: mountd
          protocol: TCP
        - containerPort: 111
          name: rpcbind
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /exports
          name: webapp-disk
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - gcePersistentDisk:
          fsType: ext4
          pdName: webapp-data-disk
        name: webapp-disk
status: {}

Rohan Singh
  • 20,497
  • 1
  • 41
  • 48
Jim G
  • 81
  • 6

1 Answers1

3

OK, fixed. I had to enable the CI driver on our legacy cluster, as described here...

Jim G
  • 81
  • 6