I have a GKE private cluster in autopilot mode running gke1.23, described below. I am trying to install an application from a vendor's helm chart, following their instructions, I use a script like this:
#! /bin/bash
helm repo add safesoftware https://safesoftware.github.io/helm-charts/
helm repo update
tag="2021.2"
version="safesoftware/fmeserver-$tag"
helm upgrade --install \
fmeserver \
$version \
--set fmeserver.image.tag=$tag \
--set deployment.hostname="REDACTED" \
--set deployment.useHostnameIngress=true \
--set deployment.tlsSecretName="my-ssl-cert" \
--namespace ingress-nginx --create-namespace \
#--set resources.core.requests.cpu="500m" \
#--set resources.queue.requests.cpu="500m" \
However, I get errors from the GKE Warden!
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "safesoftware" chart repository
Update Complete. ⎈Happy Helming!⎈
W1201 10:25:08.117532 29886 warnings.go:70] Autopilot increased resource requests for Deployment ingress-nginx/engine-standard-group to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.201656 29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/fmeserver-postgresql to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.304755 29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/core to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.392965 29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/queue to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.480421 29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/websocket to meet requirements. See http://g.co/gke/autopilot-resources.
Error: UPGRADE FAILED: cannot patch "core" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'core' cpu requests '{{400 -3} {\u003cnil\u003e} DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {\u003cnil\u003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]} && cannot patch "queue" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'queue' cpu requests '{{250 -3} {\u003cnil\u003e} DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {\u003cnil\u003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]}
So I modified the cpu requests in the resource spec for the pods causing the issues, one way is to uncomment the last two lines of the script.
--set resources.core.requests.cpu="500m" \
--set resources.queue.requests.cpu="500m" \
This lets me install or upgrade the chart but then I get PodUnschedulable, Reason Cannot schedule pods: Insufficient cpu
. Depending on the exact changes to the chart, I sometimes also see Cannot schedule pods: node(s) had volume node affinity conflict
.
I can't see how to increase either the number of pods or size of each (e2-medium) node in autopilot mode. Nor can I find a way to remove those guards. I have checked the quotas and can't see any quota issue. I can install other workloads, including ingress-nginx.
I am not sure what the issue is and I am not a expert with helm or Kubernetes.
For reference, the cluster can be described as:
addonsConfig:
cloudRunConfig:
disabled: true
loadBalancerType: LOAD_BALANCER_TYPE_EXTERNAL
configConnectorConfig: {}
dnsCacheConfig:
enabled: true
gcePersistentDiskCsiDriverConfig:
enabled: true
gcpFilestoreCsiDriverConfig:
enabled: true
gkeBackupAgentConfig: {}
horizontalPodAutoscaling: {}
httpLoadBalancing: {}
kubernetesDashboard:
disabled: true
networkPolicyConfig:
disabled: true
autopilot:
enabled: true
autoscaling:
autoprovisioningNodePoolDefaults:
imageType: COS_CONTAINERD
management:
autoRepair: true
autoUpgrade: true
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
upgradeSettings:
maxSurge: 1
strategy: SURGE
autoscalingProfile: OPTIMIZE_UTILIZATION
enableNodeAutoprovisioning: true
resourceLimits:
- maximum: '1000000000'
resourceType: cpu
- maximum: '1000000000'
resourceType: memory
- maximum: '1000000000'
resourceType: nvidia-tesla-t4
- maximum: '1000000000'
resourceType: nvidia-tesla-a100
binaryAuthorization: {}
clusterIpv4Cidr: 10.102.0.0/21
createTime: '2022-11-30T04:47:19+00:00'
currentMasterVersion: 1.23.12-gke.100
currentNodeCount: 7
currentNodeVersion: 1.23.12-gke.100
databaseEncryption:
state: DECRYPTED
defaultMaxPodsConstraint:
maxPodsPerNode: '110'
endpoint: REDACTED
id: REDACTED
initialClusterVersion: 1.23.12-gke.100
initialNodeCount: 1
instanceGroupUrls: REDACTED
ipAllocationPolicy:
clusterIpv4Cidr: 10.102.0.0/21
clusterIpv4CidrBlock: 10.102.0.0/21
clusterSecondaryRangeName: pods
servicesIpv4Cidr: 10.103.0.0/24
servicesIpv4CidrBlock: 10.103.0.0/24
servicesSecondaryRangeName: services
stackType: IPV4
useIpAliases: true
labelFingerprint: '05525394'
legacyAbac: {}
location: europe-west3
locations:
- europe-west3-c
- europe-west3-a
- europe-west3-b
loggingConfig:
componentConfig:
enableComponents:
- SYSTEM_COMPONENTS
- WORKLOADS
loggingService: logging.googleapis.com/kubernetes
maintenancePolicy:
resourceVersion: 93731cbd
window:
dailyMaintenanceWindow:
duration: PT4H0M0S
startTime: 03:00
masterAuth:
masterAuthorizedNetworksConfig:
cidrBlocks:
enabled: true
monitoringConfig:
componentConfig:
enableComponents:
- SYSTEM_COMPONENTS
monitoringService: monitoring.googleapis.com/kubernetes
name: gis-cluster-uat
network: geo-nw-uat
networkConfig:
nodeConfig:
diskSizeGb: 100
diskType: pd-standard
imageType: COS_CONTAINERD
machineType: e2-medium
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
enableSecureBoot: true
workloadMetadataConfig:
mode: GKE_METADATA
nodePoolAutoConfig: {}
nodePoolDefaults:
nodeConfigDefaults:
loggingConfig:
variantConfig:
variant: DEFAULT
nodePools:
- autoscaling:
autoprovisioned: true
enabled: true
maxNodeCount: 1000
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS_CONTAINERD
machineType: e2-medium
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
enableSecureBoot: true
workloadMetadataConfig:
mode: GKE_METADATA
initialNodeCount: 1
instanceGroupUrls:
locations:
management:
autoRepair: true
autoUpgrade: true
maxPodsConstraint:
maxPodsPerNode: '32'
name: default-pool
networkConfig:
podIpv4CidrBlock: 10.102.0.0/21
podRange: pods
podIpv4CidrSize: 26
selfLink: REDACTED
status: RUNNING
upgradeSettings:
maxSurge: 1
strategy: SURGE
version: 1.23.12-gke.100
- autoscaling:
autoprovisioned: true
enabled: true
maxNodeCount: 1000
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS_CONTAINERD
machineType: e2-standard-2
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
reservationAffinity:
consumeReservationType: NO_RESERVATION
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
enableSecureBoot: true
workloadMetadataConfig:
mode: GKE_METADATA
instanceGroupUrls:
locations:
management:
autoRepair: true
autoUpgrade: true
maxPodsConstraint:
maxPodsPerNode: '32'
name: nap-1rrw9gqf
networkConfig:
podIpv4CidrBlock: 10.102.0.0/21
podRange: pods
podIpv4CidrSize: 26
selfLink: REDACTED
status: RUNNING
upgradeSettings:
maxSurge: 1
strategy: SURGE
version: 1.23.12-gke.100
notificationConfig:
pubsub: {}
privateClusterConfig:
enablePrivateNodes: true
masterGlobalAccessConfig:
enabled: true
masterIpv4CidrBlock: 192.168.0.0/28
peeringName: gke-nf69df7b6242412e9932-582a-f600-peer
privateEndpoint: 192.168.0.2
publicEndpoint: REDACTED
releaseChannel:
channel: REGULAR
resourceLabels:
environment: uat
selfLink: REDACTED
servicesIpv4Cidr: 10.103.0.0/24
shieldedNodes:
enabled: true
status: RUNNING
subnetwork: redacted
verticalPodAutoscaling:
enabled: true
workloadIdentityConfig:
workloadPool: REDACTED
zone: europe-west3
EDIT Adding pod describe logs.
kubectl describe pod core -n ingress-nginx
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 6m49s (x86815 over 3d22h) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Warning BackOff 110s (x13994 over 3d23h) kubelet Back-off restarting failed container
kubectl describe pod queue -n ingress-nginx
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 9m29s (x18130 over 2d14h) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match pod affinity rules, 3 node(s) had volume node affinity conflict
Normal NotTriggerScaleUp 4m28s (x24992 over 2d14h) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had volume node affinity conflict, 2 node(s) didn't match pod affinity rules
Warning FailedScheduling 3m33s (x3385 over 2d14h) gke.io/optimize-utilization-scheduler 0/7 nodes are available: 1 node(s) had volume node affinity conflict, 6 Insufficient cpu.