I have created a Fie Dataset using Azure ML python API. Data under question is bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. Then, I tried to mount the dataset in AML compute instance. During this mounting process, I have observed that each parquet file has been downloaded twice under the /tmp directory of the compute instance with the following message printed as the console logs:
Downloaded path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<blob_path>/20211203.parquet is different from target path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<container_name>/<blob_path>/20211203.parquet
This log message gets printed for each parquet file which is part of the dataset.
Also, the process of mounting the dataset is very slow: 44 minutes for for ~10K parquet files each of size of 330 KB.
"%%time" command in the Jupyter Lab shows most of the time has been used for IO process?
CPU times: user 4min 22s, sys: 51.5 s, total: 5min 13s
Wall time: 44min 15s
Note: Both the Data Lake Gen 2 and Azure ML compute instance are under the same virtual network.
Here are my questions:
- How to avoid downloading the parquet file twice?
- How to make the mounting process faster?
I have gone through this thread, but the discussion there didn't conclude
The Python code I have used is as followed:
data = Dataset.File.from_files(path=list_of_blobs, validate=True)
dataset = data.register(workspace=ws, name=dataset_name, create_new_version=create_new_version)
mount_context = None
try:
mount_context = dataset.mount(path_to_mount)
# Mount the file stream
mount_context.start()
except Exception as ex:
raise(ex)
df = pd.read_parquet(path_to_mount)