5

Context

I want to train a custom model using Yolo (v8). I've got it working on my local machine, but it is very slow, and want to run the job on Azure Machine Learning Studio for efficiency. I am using Azure ML SDK v2.

Issue

When I run on Azure ML, I get an error saying that YOLO cannot locate my training images.

Traceback (most recent call last):
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/ultralytics/yolo/engine/trainer.py", line 125, in __init__
  self.data = check_det_dataset(self.args.data)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/ultralytics/yolo/data/utils.py", line 243, in check_det_dataset
  raise FileNotFoundError(msg)
FileNotFoundError: 
Dataset 'custom.yaml' not found ⚠️, missing paths ['/mnt/azureml/cr/j/18bdc3371eca4975a0c4a7123f9adaec/exe/wd/valid/images']

Code / analysis

Here is the code I use to run the job:

command_job = command(
    display_name='Test Run 1',
    code="./src/",
    command="yolo detect train data=custom.yaml model=yolov8n.pt epochs=1 imgsz=1280 seed=42",
    environment="my-custom-env:3",
    compute=compute_target
)

On my local machine (using visual studio code), the custom.yaml file is in the ./src/ directory. When I run the job above, the custom.yaml is uploaded and appears in the Code section of the job (viewed in Azure ML Studio). From investigating I think this is the compute working directory which has the path:

'/mnt/azureml/cr/j/18bdc3371eca4975a0c4a7123f9adaec/exe/wd/'

My custom.yaml looks like this:

path: ../
train: train/images
val: valid/images

nc: 1
names: ["bike"]

So what is happening is that YOLO is looking at my custom.yaml, using the root directory as the path, and the trying to find valid/images within that directory:

'/mnt/azureml/cr/j/18bdc3371eca4975a0c4a7123f9adaec/exe/wd/valid/images'

My images are in my Datastore, not that directory, hence the error.

What I have tried - updating custom.yaml path

All my data (train and valid) is contained on AzureBlobStorage. In Azure ML Studio I have created a Datastore and added my data as a Dataset (references my AzureBlobStorage account). My file structure is:

Dataset/
   - Train/
        - Images
        - Labels
   - Valid/
        - Images
        - Labels

Within my custom.yaml file I have tried replacing path with the following:

 **Storage URI**: https://mystorageaccount.blob.core.windows.net/my-datasets
 **Datastore URI**: azureml://subscriptions/XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourcegroups/my-rg/workspaces/my_workspage/datastores/my_datastore/paths/Dataset/

If I do this I get the same error. This time it appends the path to the end of the working directory. Example:

    '/mnt/azureml/cr/j/18bdc3371eca4975a0c4a7123f9adaec/exe/wd/https://mystorageaccount.blob.core.windows.net/my-datasets/valid/images'

What I have tried - mounting / download dataset

I've read the Microsoft docs - (e.g. here and here) - and it says things like:

For most scenarios, you'll use URIs (uri_folder and uri_file) - a location in storage that can be easily mapped to the filesystem of a compute node in a job by either mounting or downloading the storage to the node.

It feels like I should be mapping my data (in my Datastore) to the compute filesystem. Then I could can use that path in my custom.yaml. The documents are not clear on how I do that.

In brief: how do I set up my data on Azure ML so that the path in my custom.yaml points to the data?

Alex P
  • 12,249
  • 5
  • 51
  • 70

1 Answers1

2

A solution is to create a folder data asset with a path azureml://datastores/<data_store_name>/paths/<dataset-path> and pass it as an input to your AzureML job. AzureML jobs resolve the path of uri_folder inputs at runtime, so the custom.yaml can programmatically be updated to contain this path.

Here is an example of an AzureML job implementing this solution:

from azure.ai.ml import command
from azure.ai.ml import Input

command_job = command(
    inputs=dict(
        data=Input(
            type="uri_folder",
            path="azureml:your-data-asset:version-number",
        )
    ),
    command="""
    echo "The data asset path is ${{ inputs.data }}" &&
    # Update custom.yaml to contain the correct path
    sed -i "s|path:.*$|path: ${{ inputs.data }}|" custom.yaml &&
    # Now custom.yaml contains the correct path so we can run the training
    yolo detect train data=custom.yaml model=yolov8n.pt epochs=1 imgsz=1280 seed=42 project=your-experiment name=experiment
    """,
    code="./src/",
    environment="your-environment",
    compute="your-compute-target",
    experiment_name="your-experiment",
    display_name="your-display-name",
)

Note that you need to have the latest mlflow and azureml-mlflow libraries installed to make sure your model, parameters and metrics are logged with mlflow:

ultralytics==8.0.133
azureml-mlflow==1.52.0
mlflow==2.4.2

Edit: Note that I published a blogpost explaining all the steps to run a yolov8 training with AzureML.

In the blogpost I create the azureML dataset from a local folder. In your case the dataset is already stored in a datastore so you need to specify a path azureml://datastores/<data_store_name>/paths/<dataset-path> instead of a local path when you create the azureml data asset.

Timbus Calin
  • 13,809
  • 5
  • 41
  • 59
ouphi
  • 232
  • 2
  • 9
  • 2
    @AlexP does this solution work for you? If it does, kindly accept the answer as it helps both you, the one who provided the answer and also future readers. – Timbus Calin Apr 19 '23 at 10:59
  • This works for me, but unfortunately I have not output, nor the model or other logs, the job succeeds without any output written. I actually copy-pasted your code, it's working flawlessly, but it doesn't save anything :( – Timbus Calin Jun 29 '23 at 13:00
  • @TimbusCalin Thanks for trying the solution. It was tested with version 8.0.83 and looks like this solution does not work well with the latest yolov8 version. A quick workaround to see the logs & models with the latest yolov8 version is to add `project=logs name=somename` to your `yolo detect train ...` command. yolov8 saves the result in the {project}/{name} folder, and every file saved in the `logs` folder are available in AzureML job history. – ouphi Jul 04 '23 at 23:01
  • Let me try to get the things right, after the training is done, I should run yolo detect command on the output? Could you be more specific? You mean in the same job, or in another job which receives as input (output from the training job) the results? – Timbus Calin Jul 07 '23 at 12:17
  • 1
    @TimbusCalin I mean in the same job, you need to add the options`project=logs name=name_of_your_choice` to the yolo command. So you need to replace ```yolo detect train data=custom.yaml model=yolov8n.pt epochs=1 imgsz=1280 seed=42``` with ```yolo detect train data=custom.yaml model=yolov8n.pt epochs=1 imgsz=1280 seed=42 project=logs name=name_of_your_choice``` and then the logs and the models should appear in `logs/name_of_your_choice`. – ouphi Jul 10 '23 at 09:31
  • Thank you, I will try that. In that case, I will load the best.pt (I replaced the sed with input strings in the command job, adapted your code a bit) and then I will do as you said from the Python SDK expecting the logs to take place. – Timbus Calin Jul 10 '23 at 11:18
  • 1
    @TimbusCalin I had a closer look to the issue, looks like the mlflow integration broke. The fix is using the latest mlflow versions: `azureml-mlflow==1.52.0 mlflow==2.4.2` Note that with the current yolov8 version you need to have `project=your-experiment` matching your experiment name to make sure your mlflow metrics and models and up in your experiment. But with the next yolov8 release it should not be required. (this fix https://github.com/ultralytics/ultralytics/pull/3668 picks up the MLFLOW_EXPERIMENT_NAME env var that is automatically set by azureml) – ouphi Jul 12 '23 at 06:47
  • 1
    I've been working on my job on different tasks and I have just seen your comment. Will take that into account. – Timbus Calin Jul 12 '23 at 06:49
  • To me this solution doesn't work, I get this error : mlflow.exceptions.MlflowException: Cannot start run with ID 25bfd0ff-a19e-4520-8059-831af4834fe2 because active run ID does not match environment run ID. Make sure --experiment-name or --experiment-id matches experiment set with set_experiment(), or just use command-line arguments – Timbus Calin Jul 13 '23 at 09:51
  • If I do not use command-line arguments, then I need to update something in the project=your_experiment to match the same name with mlflow.set_experiment_name("this_is_the_experiment_name") – Timbus Calin Jul 13 '23 at 09:53
  • 1
    @TimbusCalin now with the latest yolov8 version 8.0.133 you do not need to specify the mlflow experiment name or a project option. Now it automatically picks up an env var available in your azureml job. You only need to make sure you have mlflow and azure-mlflow installed in your environment. – ouphi Jul 13 '23 at 11:12
  • Mine was 8.0.132. Will check asap! – Timbus Calin Jul 13 '23 at 11:42
  • 1
    It works indeed with the latest version! – Timbus Calin Jul 13 '23 at 14:15