Vertex AI Pipeline quota aiplatform.googleapis.com/restricted_image_training_tpu_v3_pod

Question

I'm getting started with creating a tuned model. I've got my training data in a .jsonl file, uploaded to a bucket, everything checks out. I've run the tuning 3 times and every time it fails on step 7/8.

com.google.cloud.ai.platform.common.errors.AiPlatformException: code=RESOURCE_EXHAUSTED, message=The following quota metrics exceed quota limits: aiplatform.googleapis.com/restricted_image_training_tpu_v3_pod, cause=null; Failed to create custom job.Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962; Failed to create external task or refresh its state. Task:Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962; Failed to handle the pipeline task. Task: Project number: 643054741456, Job id: 7035574795022368768, Task id: -2160728365068189696, Task name: large-language-model-tuner, Task state: DRIVER_SUCCEEDED, Execution name: projects/643054741456/locations/europe-west4/metadataStores/default/executions/6209974609820216962

I followed the steps here: Vertax AI pipeline quota with no luck.

I searched the quotas and for the quota listed in the error message, it says I'm at 0%.

It also shows no quotas are over 90%.

The docs say that these pipelines only run on us-central1, when I inspect the quota for restricted_image_training_tpu_v3_pod it says my quota is 0. I can request an increase to 1 but I would have thought the docs would mention you can't get started without that.

Here's what the pipeline looks like:

Does this [link](https://cloud.google.com/vertex-ai/docs/generative-ai/models/tune-models#quota) help you? — kiran mathew, May 23 '23 at 08:16
@kiranmathew I've read those docs, the confusing part is that it says you can only run the tuning model in us-central1 but the resources needed for that job are only available in europe-west4. If you can figure out how to configure it correctly, please let me know. — santeko, May 24 '23 at 16:59
Based on my understanding you can give the tuning job location as "europe-west4" and the tuned model location as "us-central1". — kiran mathew, May 30 '23 at 12:05
I'm having the EXACT same issue. Did you find a solution? I will try @kiranmathew's suggestion for now — user3689720, May 30 '23 at 22:14
@kiranmathew can you post steps with screenshots on how to do that? When using the vertex AI UI it doesn't let you configure much, are you using the cli instead? — santeko, May 31 '23 at 16:08
@kiranmathew have you read the docs that say you can only setup vertex ai in us-central1? If you can show with screenshots steps to set it up with workarounds, please do so. — santeko, Jun 14 '23 at 18:51

score 0 · Answer 1 · answered Jun 02 '23 at 17:15

0

To add on kiran matthew's answer,

Since the model uses 64 cores of TPU v3, you may submit a quota increase request in multiples of 64 (eg. a multiplier of 64 (1 job 64, 2 concurrent jobs 128) under Restricted image training TPU V3 pod cores per region quota.

answered Jun 02 '23 at 17:15

Soleign H.

67
6

I reached out to GCP and they said they cannot increase the quota because the resource `TPU v3` isn't available in the region the docs say you have to run Vertex Ai (us-west), you can only increase the quote in europe-west4. Looking for someone who's figured out how to get it to run to share their steps. – santeko Jun 07 '23 at 03:40

Vertex AI Pipeline quota aiplatform.googleapis.com/restricted_image_training_tpu_v3_pod

1 Answers1