What would be the recommended way to work with a FileDataSet on AmlCompute, when submitting an Estimator-based run (with docker enabled)?
My FileDataset is around 1.5Gb and contains a few 1000 images.
I have a tabular dataset with references to images in that FileDataset. This tabular dataset contain the classes or references to other (mask) images, depending on the model I'm trying to train.
So in order to load the images into memory (np.arrays), I have to read the image from the file location, based on the file name in my TabularDataset.
At this point, I see two options, but none of them are feasible, as they take ages (+1 hour) to complete, and it's just not workable:
Mount the file dataset
image_dataset = ws.datasets['imagedata']
mounted_images = image_dataset.mount()
mounted_images.start()
print('Data set mounted', datetime.datetime.now())
load_image(mounted_images.mount_point + '/myfilename.png')
Download the dataset
image_dataset = ws.datasets['chart-imagedata']
image_dataset.download(target_path='chartimages', overwrite=False)
I want to get the fastest possible way to launch an Estimator on AmlCompute and get access to the files as quick and easy as possible.
I had a look at this post on stackoverflow, where they indicated that it would be good to update the azureml sdk packages in the train.py script, and I have applied that, but no difference.
EDITED (more information):
- Data source is Azure Blob Storage (storage account has ADLS 2.0 enabled)
- Size of my compute target (cluster of 0-4, but only using 1 node) of size
STANDARD_D2_V2
The train.py I am using (just for repro purposes):
# Force latest prerelease version of certain packages
import subprocess
import sys
def install(package):
subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "--pre", package])
install('azureml-core')
install('azureml-sdk')
# General references
import argparse
import os
import numpy as np
import pandas as pd
import datetime
from azureml.core import Workspace, Dataset, Datastore, Run, Experiment
import sys
import time
ws = Run.get_context().experiment.workspace
# Download file data set
print('Downloading data set', datetime.datetime.now())
image_dataset = ws.datasets['chart-imagedata']
image_dataset.download(target_path='chartimages', overwrite=False)
print('Data set downloaded', datetime.datetime.now())
# mount file data set
print('Mounting data set', datetime.datetime.now())
image_dataset = ws.datasets['chart-imagedata']
mounted_images = image_dataset.mount()
mounted_images.start()
print('Data set mounted', datetime.datetime.now())
print('Training finished')
And I'm using a TensorFlow Estimator:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# Choose a name for your CPU cluster
gpu_cluster_name = "g-train-cluster"
# Verify that cluster does not exist already
try:
gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)
print('Found existing cluster, use it.')
except ComputeTargetException:
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
max_nodes=4, min_nodes=0)
gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
print('Creating new cluster')
constructor_parameters = {
'source_directory':training_name,
'script_params':script_parameters,
'compute_target':gpu_cluster,
'entry_script':'train.py',
'pip_requirements_file':'requirements.txt',
'use_gpu':True,
'framework_version': '2.0',
'use_docker':True}
estimator = TensorFlow(**constructor_parameters)
run = self.__experiment.submit(estimator)