Best way to work with larger FileDataSet in azureml on docker-based AmlCompute

Question

What would be the recommended way to work with a FileDataSet on AmlCompute, when submitting an Estimator-based run (with docker enabled)?

My FileDataset is around 1.5Gb and contains a few 1000 images.
I have a tabular dataset with references to images in that FileDataset. This tabular dataset contain the classes or references to other (mask) images, depending on the model I'm trying to train.

So in order to load the images into memory (np.arrays), I have to read the image from the file location, based on the file name in my TabularDataset.

At this point, I see two options, but none of them are feasible, as they take ages (+1 hour) to complete, and it's just not workable:

Mount the file dataset

image_dataset = ws.datasets['imagedata']
mounted_images = image_dataset.mount()
mounted_images.start()
print('Data set mounted', datetime.datetime.now())

load_image(mounted_images.mount_point + '/myfilename.png')

Download the dataset

image_dataset = ws.datasets['chart-imagedata']
image_dataset.download(target_path='chartimages', overwrite=False)

I want to get the fastest possible way to launch an Estimator on AmlCompute and get access to the files as quick and easy as possible.

I had a look at this post on stackoverflow, where they indicated that it would be good to update the azureml sdk packages in the train.py script, and I have applied that, but no difference.

EDITED (more information):

Data source is Azure Blob Storage (storage account has ADLS 2.0 enabled)
Size of my compute target (cluster of 0-4, but only using 1 node) of size STANDARD_D2_V2

The train.py I am using (just for repro purposes):

# Force latest prerelease version of certain packages
import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "--pre", package])

install('azureml-core')
install('azureml-sdk')

# General references
import argparse
import os
import numpy as np
import pandas as pd
import datetime

from azureml.core import Workspace, Dataset, Datastore, Run, Experiment

import sys
import time

ws = Run.get_context().experiment.workspace

# Download file data set
print('Downloading data set', datetime.datetime.now())
image_dataset = ws.datasets['chart-imagedata']
image_dataset.download(target_path='chartimages', overwrite=False)
print('Data set downloaded', datetime.datetime.now())


# mount file data set
print('Mounting data set', datetime.datetime.now())
image_dataset = ws.datasets['chart-imagedata']
mounted_images = image_dataset.mount()
mounted_images.start()
print('Data set mounted', datetime.datetime.now())

print('Training finished')

And I'm using a TensorFlow Estimator:

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException


# Choose a name for your CPU cluster
gpu_cluster_name = "g-train-cluster"

# Verify that cluster does not exist already
try:
    gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4, min_nodes=0)
    gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
    print('Creating new cluster')

constructor_parameters = {
    'source_directory':training_name,
    'script_params':script_parameters,
    'compute_target':gpu_cluster,
    'entry_script':'train.py',
    'pip_requirements_file':'requirements.txt', 
    'use_gpu':True,
    'framework_version': '2.0',
    'use_docker':True}

estimator = TensorFlow(**constructor_parameters)
run = self.__experiment.submit(estimator)

can you edit your post to include: 1) the data source of the FileDataset (blob ADLS2?), 2) the size of your compute target, and 3) more info on the code you're sending to the estimator? — Anders Swanson, Jun 07 '20 at 17:36
thanks for the feedback. May I know the AzureML SDK version you are using? You can print the version in your script: print("Azure ML SDK Version: ", azureml.core.VERSION) — May Hu, Jun 08 '20 at 16:27
Hello May, the version of azureml sdk on the client side (where I submit the run) is 1.5.0, but when I output the version on the AmlCompute (from my train.py script), it's 1.6.0. — Sam Vanhoutte, Jun 08 '20 at 18:11
unsolicited feedback, but I recommend that you create an `Environment` and `CondaDependencies` object to define your pip dependencies and [pip option of `'--pre'`](https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.core.conda_dependencies.condadependencies?view=azure-ml-py#set-pip-option-pip-option-). The benefits will be 1) train.py finishes faster and 2) environemnt definition doesn't happen inside of the Estimator but rather the control plane. — Anders Swanson, Jun 10 '20 at 18:32
[Here's an example](https://github.com/Azure/MachineLearningNotebooks/blob/6d11cdfa0a8869775a29cd819d9b479978e39814/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/pipeline-for-image-classification.ipynb) of setting `pip_option` with `CondaDependencies` — Anders Swanson, Jun 10 '20 at 18:38
Thanks, anders. valid feedback and I've taken that up by now. I've also been in contact with the product group directly, as it seems this behavior is not normal. Will update once I know more — Sam Vanhoutte, Jun 11 '20 at 10:38

Best way to work with larger FileDataSet in azureml on docker-based AmlCompute

0 Answers0