I started a free trial with Databricks and everything was running perfectly. The trial ended on the 28th April and I am assuming I was simply transferred to the normal premium paid plan. I last used my general cluster on the 2nd May. Since coming back from a weeks holiday, I am unable to restart my general compute cluster. I tried deleting this cluster and creating a new one. I was stuck in the state of "Finding instances for new nodes, acquiring new instances if necessary" and have been for nearly 2 hours.
I currently have below the required minimum quota on GCP for n2_cpus. I have 24, and have requested an increase to 50 which my google sales rep should be handling for me.
Interestingly I noticed on my GCP logs explorer, that my GKE cluster was deleted using the following request on the 7th May:
requestMetadata: { callerIp: "gce-internal-ip" callerSuppliedUserAgent: "databricks-api/1.0 Google-API-Java-Client/1.34.0 Google-HTTP-Java-Client/1.42.3 (gzip),gzip(gfe)" destinationAttributes: { } requestAttributes: { auth: {0} time: "2023-05-07T17:06:09.891630201Z" }}
With the subsequent notification in my logs:
"google.container.v1.ClusterManager.DeleteCluster"
I believe this is the reason I can no longer start or create clusters in databricks. How is it possible that this GKE was deleted? It certainly did not come from us internally. I am thinking of creating a new workspace, but I also worry the same thing will happen again causing a loss of data.
UPDATE
Upon looking closer at the logs, the deletion of the GKE was triggered by my databricks service account. This means the deletion came from databricks itself. Does anyone know why or how this might have happened? My workspace on databricks has not been deleted.
ANOTHER UPDATE
This is actually normal behaviour. Databricks deletes the GKE after 5 days of inactivity to reduce costs. Normally though, databricks will recreate the GKE when a cluster is spun up again on the databricks side. For some reason, whenever I try to start a compute cluster on databricks I get many error messages from GKE in logs explorer. Specifically the databricks service account is using this method:
"google.container.v1.ClusterManager.CreateCluster"
Yet, every time this triggers an "internal error" warning. This seems strange since I had no issues creating the GKE when I first created my Databricks workspace.