5

During a training script executed on a compute target, we're trying to download a registered Dataset from an ADLS2 Datastore. The problem is that it takes hours to download ~1.5Gb (splitted into ~8500 files) to the compute target with the following method :

from azureml.core import Datastore, Dataset, Run, Workspace

# Retrieve the run context to get Workspace
RUN = Run.get_context(allow_offline=True)

# Retrieve the workspace
ws = RUN.experiment.workspace

# Creating the Dataset object based on a registered Dataset
dataset = Dataset.get_by_name(ws, name='my_dataset_registered')

# Download the Dataset locally
dataset.download(target_path='/tmp/data', overwrite=False)

Important note : the Dataset is registered to a path in the Datalake that contains a lot of subfolders (as well subsubfolders, ..) containing small files of around 170Kb.

Note: I'm able to download the complete dataset to local computer within a few minutes using az copy or the Storage Explorer. Also, the Dataset is defined at a folder stage with the ** wildcard for scanning subfolders : datalake/relative/path/to/folder/**

Is that a known issue ? How can I improve transfer speed ?

Thanks !

Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
dernat71
  • 365
  • 4
  • 16
  • From an answer from @Monica Kei but she cannot comment: Could you mention what versions of azureml-core and azureml-dataprep SDK you have installed on your compute instance? And what VM size? I'm trying a similar scenario to what you're describing. When downloading from a local notebook, it takes about 10 minutes to download 10,000 small files. But on a compute instance, it took about 1.5 hours and then errored out. Trying to figure out what your set up is and narrow down the problem. – dan1st Mar 11 '20 at 06:09

2 Answers2

3

Edited to be more answer-like:

It'd be helpful to include: what versions of azureml-core and azureml-dataprep SDK you are using, what type of VM you are running as the compute instance, and what types of files (e.g. jpg? txt?) your dataset is using. Also, what are you trying to achieve by downloading the complete dataset to your compute?

Currently, compute instance image comes with azureml-core 1.0.83 and azureml-dataprep 1.1.35 pre-installed, which are 1-2 months old. You might be using even older versions. You can try upgrading by running in your notebook:

%pip install -U azureml-sdk

If you don't see any improvements to your scenario, you can file an issue on the official docs page to get someone to help debug your issue, such as the ref page for FileDataset.

(edited on June 9, 2020 to remove mention of experimental release because that is not happening anymore)

Monica Kei
  • 46
  • 5
  • I've added this as a comment for you :) – dan1st Mar 11 '20 at 06:09
  • Thanks for the advice :) – Monica Kei Mar 11 '20 at 17:46
  • Hi @MonicaKei, thanks a lot for the advice (coming back a bit late on this). Updating azureml-core from 1.6.0 to 1.7.0 definitely helped a lot in terms of downloading behaviors and speed ! To answer your questions : I'm downloading .parquet files from ADLS2 to allow Data Scientists to perform some EDAs interactively. Would you have a better pattern to follow for interactive EDA ? :-) – dernat71 Jun 08 '20 at 19:13
  • 1
    Hi @Nethim, actually in the last few months, there have been more discoveries about performance issues when downloading/mounting datasets comprised of a large number of small files. There has been (and still is) work to improve this! For interactive EDA, have you explored [Azure Machine Learning Studio](https://ml.azure.com)? – Monica Kei Jun 10 '20 at 00:49
-1

DataTransferStep creates an Azure ML Pipeline step that transfers data between.

Please follow the below for DataTransferStep class. https://learn.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.data_transfer_step.datatransferstep?view=azure-ml-py

Ram
  • 2,459
  • 1
  • 7
  • 14