2

I'm getting started with creating a tuned model. I've got my training data in a .jsonl file, uploaded to a bucket, everything checks out. I've run the tuning 3 times and every time it fails on step 7/8.

com.google.cloud.ai.platform.common.errors.AiPlatformException: code=RESOURCE_EXHAUSTED, message=The following quota metrics exceed quota limits: aiplatform.googleapis.com/restricted_image_training_tpu_v3_pod, cause=null; Failed to create custom job.Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962; Failed to create external task or refresh its state. Task:Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962; Failed to handle the pipeline task. Task: Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962

I followed the steps here: Vertax AI pipeline quota with no luck.

I searched the quotas and for the quota listed in the error message, it says I'm at 0%. enter image description here

It also shows no quotas are over 90%.

The docs say that these pipelines only run on us-central1, when I inspect the quota for restricted_image_training_tpu_v3_pod it says my quota is 0. I can request an increase to 1 but I would have thought the docs would mention you can't get started without that. enter image description here

Here's what the pipeline looks like: enter image description here

santeko
  • 360
  • 5
  • 10
  • Does this [link](https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-models#quota) help you? – kiran mathew May 23 '23 at 08:16
  • @kiranmathew I've read those docs, the confusing part is that it says you can only run the tuning model in us-central1 but the resources needed for that job are only available in europe-west4. If you can figure out how to configure it correctly, please let me know. – santeko May 24 '23 at 16:59
  • Based on my understanding you can give the tuning job location as "europe-west4" and the tuned model location as "us-central1". – kiran mathew May 30 '23 at 12:05
  • I'm having the EXACT same issue. Did you find a solution? I will try @kiranmathew's suggestion for now – user3689720 May 30 '23 at 22:14
  • @kiranmathew can you post steps with screenshots on how to do that? When using the vertex AI UI it doesn't let you configure much, are you using the cli instead? – santeko May 31 '23 at 16:08
  • Hi @santeko,have you tried in the `europe-west4` region? – kiran mathew Jun 08 '23 at 07:49
  • @kiranmathew have you read the docs that say you can only setup vertex ai in us-central1? If you can show with screenshots steps to set it up with workarounds, please do so. – santeko Jun 14 '23 at 18:51

1 Answers1

0

To add on kiran matthew's answer,

Since the model uses 64 cores of TPU v3, you may submit a quota increase request in multiples of 64 (eg. a multiplier of 64 (1 job 64, 2 concurrent jobs 128) under Restricted image training TPU V3 pod cores per region quota.

Soleign H.
  • 67
  • 6
  • I reached out to GCP and they said they cannot increase the quota because the resource `TPU v3` isn't available in the region the docs say you have to run Vertex Ai (us-west), you can only increase the quote in europe-west4. Looking for someone who's figured out how to get it to run to share their steps. – santeko Jun 07 '23 at 03:40