I want to set up Azure Arc on a Google Cloud GKE Autopilot cluster so I can manage its K8 resources in Azure. I am just setting up my first GKE cluster and my first Azure Arc Connection too. I am following the quick start here (https://learn.microsoft.com/en-us/azure/azure-arc/kubernetes/quickstart-connect-cluster?tabs=azure-cli#prerequisites). I have an active GKE cluster. There is an azure command that establishes the link AND deploys resources via Helm to my GKE cluster (which is set as the default kubectl context).
The job sent to my GKE cluster always fails.. this is the describe for the job that is set on my cluster... (I grabbed it while it was running)...
Name: cluster-diagnostic-checks-job
Namespace: azure-arc-release
Selector: controller-uid=1285d828-698e-4e7d-b03d-ac819e793024
Labels: app=cluster-diagnostic-checks
app.kubernetes.io/managed-by=Helm
Annotations: autopilot.gke.io/resource-adjustment:
{"input":{"containers":[{"name":"cluster-diagnostic-checks-container"}]},"output":{"containers":[{"limits":{"cpu":"500m","ephemeral-storag...
batch.kubernetes.io/job-tracking:
meta.helm.sh/release-name: cluster-diagnostic-checks
meta.helm.sh/release-namespace: azure-arc-release
Parallelism: 1
Completions: 1
Completion Mode: NonIndexed
Start Time: Tue, 16 May 2023 10:17:09 -0700
Pods Statuses: 1 Active (0 Ready) / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=cluster-diagnostic-checks
controller-uid=1285d828-698e-4e7d-b03d-ac819e793024
job-name=cluster-diagnostic-checks-job
Service Account: cluster-diagnostic-checkssa
Containers:
cluster-diagnostic-checks-container:
Image: mcr.microsoft.com/azurearck8s/clusterdiagnosticchecks:v0.1.0
Port: <none>
Host Port: <none>
Command:
/bin/bash
/cluster_diagnostic_checks_job_script.sh
Args:
None
None
None
eastus
AZUREPUBLICCLOUD
Limits:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
Requests:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 10s job-controller Created pod: cluster-diagnostic-checks-job-dkql8
Here is the describe for the pod...
Name: cluster-diagnostic-checks-job-dkql8
Namespace: azure-arc-release
Priority: 0
Service Account: cluster-diagnostic-checkssa
Node: <none>
Labels: app=cluster-diagnostic-checks
controller-uid=1285d828-698e-4e7d-b03d-ac819e793024
job-name=cluster-diagnostic-checks-job
Annotations: <none>
Status: Pending
SeccompProfile: RuntimeDefault
IP:
IPs: <none>
Controlled By: Job/cluster-diagnostic-checks-job
Containers:
cluster-diagnostic-checks-container:
Image: mcr.microsoft.com/azurearck8s/clusterdiagnosticchecks:v0.1.0
Port: <none>
Host Port: <none>
Command:
/bin/bash
/cluster_diagnostic_checks_job_script.sh
Args:
None
None
None
eastus
AZUREPUBLICCLOUD
Limits:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
Requests:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5gxkd (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-5gxkd:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: kubernetes.io/os=linux
Tolerations: kubernetes.io/arch=amd64:NoSchedule
kubernetes.io/arch=arm64:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 16s gke.io/optimize-utilization-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1684257394}, 2 Insufficient cpu, 2 Insufficient memory. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod.
Normal TriggeredScaleUp 11s cluster-autoscaler pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/subscripify/zones/us-central1-a/instanceGroups/gk3-autopilot-cluster-1-pool-1-3cb7bde1-grp 0->1 (max: 1000)}]
Unfortunately, the container does not produce any logs whatsoever.
I don't think this is a resource problem, I am looking at the resource quota limits on Google Cloud here(https://console.cloud.google.com/iam-admin/quotas?project=my-project) and they seem adequate - but I am a little less experienced with Google Cloud than I am Azure. Is there anyone out there that has tried this (specifically Azure Arc connected to GKE autopilot cluster) and been successful? If so - can you offer a little nudge in the right direction?