2

I have an ADF mapping data flow that uses an ADLS gen2 source with a large number of small, say 10kB, files. 98% of the flows time is spent in "file system init duration" in this source. I can't seem to find any documentation on what may affect file system init, and how to improve it. Can anyone point me to some documentation on this?

Thanks! Ed

ebclark
  • 45
  • 4
  • 1
    Can you try it with the Allow Schema Drift option turned off in the source and see if it improves the init duration for you? – Mark Kromer MSFT May 23 '22 at 22:49
  • I too have a data flow with this issue - even when I use the "sampling" or "filter by last modified" options, it takes HOURS to read a hundred thousand files to find the few required to process. Inline source, JSON, no schema drift option selectable. – Todd McDermid Oct 25 '22 at 15:46
  • Yea, the prefix filtering is significantly faster than last modified filtering. I ended up making this reasonably performant by organizing the files into paths that include the datetime, say /year/month/day/hour/minute/second numeric values. Once you have filtered down to a small subset of files you can apply last modified filtering or other options. – ebclark Oct 26 '22 at 16:06
  • We also have this problem, it seems a bit ridiculous to spend all it's time figuring out the file system. Compare to "on premise" when figuring out file systems, which happens in a few seconds, our process takes 35 minutes! Ours is using the Common Data Model on Data Lake, which you would think would be optimized. – blobbles Dec 01 '22 at 19:34

0 Answers0