I am running an AzureML job on an attached Kubernetes compute cluster on a custom instance type with a resource limit of 2 GPUs.
When I trigger the job, only 1 GPU is available because other jobs use the other GPUs. I want the job to be queued and start when a total of 2 GPUs become available, but instead, I can see the following error in the job Tags:
retry-reason-1 : 03/08/2023 10:45:05 +00:00, FailureMsg: PodPattern matched: {"reason":"UnexpectedAdmissionError","message":"Pod Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 2, Available: 1, which is unexpected"}, FailureCode: -1006
It makes 10 retries, and then the job fails. Is there a way to change this behavior? For example, set up a maximum waiting time to ensure the job is queued for longer and does not fail so fast.
I trigger the job with the az CLI:
az ml job create -f myjob.yaml
And my job definition looks like this:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
experiment_name: my-experiment
command: |
python myscript.py
code: .
environment: azureml:my-environment:1
compute: azureml:my-onprem-compute
resources:
instance_type: myinstancetypewith2gpus