Long Azure Data Factory mapping data flow "file system init duration"

Question

I have an ADF mapping data flow that uses an ADLS gen2 source with a large number of small, say 10kB, files. 98% of the flows time is spent in "file system init duration" in this source. I can't seem to find any documentation on what may affect file system init, and how to improve it. Can anyone point me to some documentation on this?

Thanks! Ed

Can you try it with the Allow Schema Drift option turned off in the source and see if it improves the init duration for you? — Mark Kromer MSFT, May 23 '22 at 22:49
I too have a data flow with this issue - even when I use the "sampling" or "filter by last modified" options, it takes HOURS to read a hundred thousand files to find the few required to process. Inline source, JSON, no schema drift option selectable. — Todd McDermid, Oct 25 '22 at 15:46
Yea, the prefix filtering is significantly faster than last modified filtering. I ended up making this reasonably performant by organizing the files into paths that include the datetime, say /year/month/day/hour/minute/second numeric values. Once you have filtered down to a small subset of files you can apply last modified filtering or other options. — ebclark, Oct 26 '22 at 16:06
We also have this problem, it seems a bit ridiculous to spend all it's time figuring out the file system. Compare to "on premise" when figuring out file systems, which happens in a few seconds, our process takes 35 minutes! Ours is using the Common Data Model on Data Lake, which you would think would be optimized. — blobbles, Dec 01 '22 at 19:34

Long Azure Data Factory mapping data flow "file system init duration"

0 Answers0