1

I have created a Fie Dataset using Azure ML python API. Data under question is bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. Then, I tried to mount the dataset in AML compute instance. During this mounting process, I have observed that each parquet file has been downloaded twice under the /tmp directory of the compute instance with the following message printed as the console logs:

Downloaded path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<blob_path>/20211203.parquet is different from target path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<container_name>/<blob_path>/20211203.parquet

This log message gets printed for each parquet file which is part of the dataset.

Also, the process of mounting the dataset is very slow: 44 minutes for for ~10K parquet files each of size of 330 KB.

"%%time" command in the Jupyter Lab shows most of the time has been used for IO process?

CPU times: user 4min 22s, sys: 51.5 s, total: 5min 13s
Wall time: 44min 15s

Note: Both the Data Lake Gen 2 and Azure ML compute instance are under the same virtual network.

Here are my questions:

  1. How to avoid downloading the parquet file twice?
  2. How to make the mounting process faster?

I have gone through this thread, but the discussion there didn't conclude

The Python code I have used is as followed:

data = Dataset.File.from_files(path=list_of_blobs, validate=True)
dataset = data.register(workspace=ws, name=dataset_name, create_new_version=create_new_version)
mount_context = None
try:
    mount_context = dataset.mount(path_to_mount)
    # Mount the file stream
    mount_context.start()
except Exception as ex:
    raise(ex)

df = pd.read_parquet(path_to_mount)
Arnab Biswas
  • 4,495
  • 3
  • 42
  • 60

2 Answers2

2

The robust option is to download directly from AzureBlobDatastore. You need to know the datastore and relative path, which you get by printing the dataset description. Namely

ws = Workspace.from_config()
dstore = ws.datastores.get(dstore_name)
target = (dstore, dstore_path)
with tempfile.TemporaryDirectory() as tmpdir:
    ds = Dataset.File.from_files(target)
    ds.download(tmpdir)
    df = pd.read_parquet(tmpdir)

The convenient option is to stream tabular datasets. Note that you don't control how the file is read (Microsoft converters may occasionally not work as you expect). Here is the template:

ds = Dataset.Tabular.from_parquet_files(target)
df = ds.to_pandas_dataframe()
Maciej Skorski
  • 2,303
  • 6
  • 14
1

I have executed a bunch of tests to compare the performance of FileDataset.mount() and FileDataset.download(). In my environment, download() is much faster than mount().

download() works well when the disk size of the compute is large enough to fit all the files. However, in a multi-node environment, the same data (in my case parquet files) gets downloaded to each of the nodes (multiple copies). As per the documentation:

If your script processes all the files in your dataset and the disk on your compute resource is large enough for the dataset, the download access mode is the better choice. The download access mode will avoid the overhead of streaming the data at runtime. If your script accesses a subset of the dataset or it's too large for your compute, use the mount access mode.

Downloading data in a multi-node environment could trigger performance issues (link). In such a case, mount() might be preferred.

I have tried with TabularDataset as well. As Maciej S has mentioned, in case of TabularDataset user doesn't need to decide how data is read from the datastore (i.e. user doesn't need to select mount or download as access mode). But, with the current implementation (azureml-core 1.38.0) of TabularDataset, compute needs to have larger memory (RAM) compared to FileDataset.download() for identical set of parquet files. Looks like, the current implementation reads all the individual parquet files into pandas DataFrame (which gets saved into memory/RAM) first. Then it appends those into a single DataFrame (accessed by the API user). Higher memory might be needed for this "eager" nature of the API.

Sourav Ghosh
  • 133,132
  • 16
  • 183
  • 261
Arnab Biswas
  • 4,495
  • 3
  • 42
  • 60