0

I want to set up Azure Arc on a Google Cloud GKE Autopilot cluster so I can manage its K8 resources in Azure. I am just setting up my first GKE cluster and my first Azure Arc Connection too. I am following the quick start here (https://learn.microsoft.com/en-us/azure/azure-arc/kubernetes/quickstart-connect-cluster?tabs=azure-cli#prerequisites). I have an active GKE cluster. There is an azure command that establishes the link AND deploys resources via Helm to my GKE cluster (which is set as the default kubectl context).

The job sent to my GKE cluster always fails.. this is the describe for the job that is set on my cluster... (I grabbed it while it was running)...

Name:             cluster-diagnostic-checks-job
Namespace:        azure-arc-release
Selector:         controller-uid=1285d828-698e-4e7d-b03d-ac819e793024
Labels:           app=cluster-diagnostic-checks
                  app.kubernetes.io/managed-by=Helm
Annotations:      autopilot.gke.io/resource-adjustment:
                    {"input":{"containers":[{"name":"cluster-diagnostic-checks-container"}]},"output":{"containers":[{"limits":{"cpu":"500m","ephemeral-storag...
                  batch.kubernetes.io/job-tracking: 
                  meta.helm.sh/release-name: cluster-diagnostic-checks
                  meta.helm.sh/release-namespace: azure-arc-release
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Tue, 16 May 2023 10:17:09 -0700
Pods Statuses:    1 Active (0 Ready) / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=cluster-diagnostic-checks
                    controller-uid=1285d828-698e-4e7d-b03d-ac819e793024
                    job-name=cluster-diagnostic-checks-job
  Service Account:  cluster-diagnostic-checkssa
  Containers:
   cluster-diagnostic-checks-container:
    Image:      mcr.microsoft.com/azurearck8s/clusterdiagnosticchecks:v0.1.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      /cluster_diagnostic_checks_job_script.sh
    Args:
      None
      None
      None
      eastus
      AZUREPUBLICCLOUD
    Limits:
      cpu:                500m
      ephemeral-storage:  1Gi
      memory:             2Gi
    Requests:
      cpu:                500m
      ephemeral-storage:  1Gi
      memory:             2Gi
    Environment:          <none>
    Mounts:               <none>
  Volumes:                <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  10s   job-controller  Created pod: cluster-diagnostic-checks-job-dkql8

Here is the describe for the pod...

Name:             cluster-diagnostic-checks-job-dkql8
Namespace:        azure-arc-release
Priority:         0
Service Account:  cluster-diagnostic-checkssa
Node:             <none>
Labels:           app=cluster-diagnostic-checks
                  controller-uid=1285d828-698e-4e7d-b03d-ac819e793024
                  job-name=cluster-diagnostic-checks-job
Annotations:      <none>
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:               
IPs:              <none>
Controlled By:    Job/cluster-diagnostic-checks-job
Containers:
  cluster-diagnostic-checks-container:
    Image:      mcr.microsoft.com/azurearck8s/clusterdiagnosticchecks:v0.1.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      /cluster_diagnostic_checks_job_script.sh
    Args:
      None
      None
      None
      eastus
      AZUREPUBLICCLOUD
    Limits:
      cpu:                500m
      ephemeral-storage:  1Gi
      memory:             2Gi
    Requests:
      cpu:                500m
      ephemeral-storage:  1Gi
      memory:             2Gi
    Environment:          <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5gxkd (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-5gxkd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 kubernetes.io/arch=amd64:NoSchedule
                             kubernetes.io/arch=arm64:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From                                   Message
  ----     ------            ----  ----                                   -------
  Warning  FailedScheduling  16s   gke.io/optimize-utilization-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1684257394}, 2 Insufficient cpu, 2 Insufficient memory. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod.
  Normal   TriggeredScaleUp  11s   cluster-autoscaler                     pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/subscripify/zones/us-central1-a/instanceGroups/gk3-autopilot-cluster-1-pool-1-3cb7bde1-grp 0->1 (max: 1000)}]

Unfortunately, the container does not produce any logs whatsoever.

I don't think this is a resource problem, I am looking at the resource quota limits on Google Cloud here(https://console.cloud.google.com/iam-admin/quotas?project=my-project) and they seem adequate - but I am a little less experienced with Google Cloud than I am Azure. Is there anyone out there that has tried this (specifically Azure Arc connected to GKE autopilot cluster) and been successful? If so - can you offer a little nudge in the right direction?

williamohara
  • 57
  • 1
  • 8
  • You may just need to give it a few minutes after running the deployment. Looks like autoscaling has been triggered. – Gari Singh May 12 '23 at 07:41
  • Can I tell an autopilot cluster to add nodes, @HarshManvar? – williamohara May 13 '23 at 13:19
  • I tried letting it sit for half a day, @GariSingh, the job just stops. – williamohara May 13 '23 at 13:22
  • Sorry i totally missed it’s autopilot. – Harsh Manvar May 13 '23 at 14:12
  • Hmm ... can you output the logs from the container itself? It looks like a node scaleup is being triggered. Can you also describe the deployment itself? – Gari Singh May 14 '23 at 08:24
  • @GariSingh , Its actually a Job submitted by my Azure CLI through its custom Helm implementation. I am trying to see if I can glance at the Helm charts for it. I was able to describe the job before it shuts itself down... I edited my initial post with the output... – williamohara May 16 '23 at 17:21
  • Also, @GariSingh I had to run it again to get the extra info... the Pod describe is a little different than the first version. – williamohara May 16 '23 at 17:29

0 Answers0