During a training script executed on a compute target, we're trying to download a registered Dataset from an ADLS2 Datastore. The problem is that it takes hours to download ~1.5Gb (splitted into ~8500 files) to the compute target with the following method :
from azureml.core import Datastore, Dataset, Run, Workspace
# Retrieve the run context to get Workspace
RUN = Run.get_context(allow_offline=True)
# Retrieve the workspace
ws = RUN.experiment.workspace
# Creating the Dataset object based on a registered Dataset
dataset = Dataset.get_by_name(ws, name='my_dataset_registered')
# Download the Dataset locally
dataset.download(target_path='/tmp/data', overwrite=False)
Important note : the Dataset is registered to a path in the Datalake that contains a lot of subfolders (as well subsubfolders, ..) containing small files of around 170Kb.
Note: I'm able to download the complete dataset to local computer within a few minutes using az copy
or the Storage Explorer. Also, the Dataset is defined at a folder stage with the ** wildcard for scanning subfolders : datalake/relative/path/to/folder/**
Is that a known issue ? How can I improve transfer speed ?
Thanks !