Loading millions of small files from Azure Data Lake Store to Data Bricks

Question

I've got a partitioned folder structure in the Azure Data Lake Store containing roughly 6 million json files (size couple of kb's to 2 mb). I'm trying to extract some fields from these files using Python code in Data Bricks.

Currently I'm trying the following:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx")
spark.conf.set("dfs.adls.oauth2.credential", "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx/oauth2/token")

df = spark.read.json("adl://xxxxxxx.azuredatalakestore.net/staging/filetype/category/2017/*/")

This example even reads only a part of the files since it points to "staging/filetype/category/2017/". It seems to work and there are some jobs starting when I run these commands. It's just very slow.

Job 40 indexes all of the subfolders and is relatively fast

Job 41 checks a set of the files and seems a bit to fast to be true

Then comes job 42, and that's where the slowness starts. It seems to do the same activities as job 41, just... slow

I have a feeling that I have a similar problem to this thread. But the speed of job 41 makes me doubtful. Are there faster ways to do this?

I have had some issues/performance problems using wildcards in spark routes. I try always to avoid them, by previously expanding the needed routes and using ``spark.read.json("adl://xxxxxxx.azuredatalakestore.net/staging/filetype/category/2017/whatever/*.json")``. In conclusion, use wildcards to read a set of files in a leaf node. Take into account that spark internally calls ``list()`` operation recursively and that is quite expensive. — Tomás Denis Reyes Sánchez, Apr 11 '20 at 12:07

score 1 · Answer 1 · answered Apr 13 '18 at 20:42

To add to Jason's answer:

We have run some test jobs in Azure Data Lake operating on about 1.7m files with U-SQL and were able to complete the processing in about 20 hours with 10 AUs. The job was generating several thousand extract vertices, so with a larger number of AUs, it could have finished in a fraction of the time.

We have not tested 6m files, but if you are willing to try, please let us know.

In any case, I do concur with Jason's suggestion to reduce the number and make the files larger.

score 1 · Answer 2 · edited Sep 18 '18 at 16:53

1

we combine files at hourly basis using Azure function and that brings down file processing significantly. So, try combining files before you send it to ADB cluster for processing. IF NOT - either you have a very high number of worker nodes and that might increase your cost.

edited Sep 18 '18 at 16:53

Ibo

4,081
6
45
65

answered Sep 18 '18 at 16:30

Naiveforever

11
1

score 0 · Answer 3 · answered Apr 11 '18 at 13:12

0

I think you will need to look at combining the files before processing. Both to increase size and reduce the number of files. The optimal file size is about 250mb. There are a number of ways to do this perhaps the easiest would be to use azure data lake analytics jobs or even use spark to iterate over a subset of the files

answered Apr 11 '18 at 13:12

Jason Horner

3,630
3
23
29

Do you have some pointers on how to merge files do it with databricks/spark? Some of the files are just to large to fit into a c# string data type and that makes it hard to do it through a data lake analytics job – Simon Zeinstra Apr 13 '18 at 20:11

Loading millions of small files from Azure Data Lake Store to Data Bricks

3 Answers3