2

I'm working with Azure Databricks where I am stream reading files from an Azure Datalake container. I am using the autoloader functionality with .format("cloudFiles").

The file structure in the container looks like this:

  • ../accounts/*.csv
  • ../accounts/snapshot/*.csv

Both folders (files and snapshot) contains lots of csv files and while for the loading path i'm only specifying "../files" anything that is in snapshot is also getting ingested by the autoloader.

My problem is: I want the autoloader to ignore any subfolder, but I have no clue how can that be achieved.

For technical restrictions this snapshot folder cannot be moved as it is produced by the MS Dynamics365 data export functionality; essentially it creates its own file structure and doesn't allow for much configuration.

Thorough testing proved that the subfolder is getting picked up, but we have no idea why. Is it a default behaviour of autoloader?

The options I am using for this:

  • ("cloudFiles.useNotifications", True) -> this is an implementational requirement. Could this be the problem?
  • ("recursiveFileLookup", False) -> The default value is false, but the existence of this option suggests that reading subfolders is not the default behaviour of autoloader.

As a workaround I can call the .option("pathGlobFilter", ?????.csv) which then filters for the files in account which are different from snapshot, but there has to be something else that resolves this.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Did you manage to find a fix for this? – arcamax Nov 30 '22 at 18:50
  • Short answer: No. First we were calling a drop_duplicates() so we ensured we dont have the stuff from snapshots, but then we realized we were looking at this the wrong way. We started reading data only from the snapshot: the latest file generated there has the lates version of the file that is getting changed in the root folder. We avoided the mutable file reading problem (which can occur with autoloader) and we are no longer loading dupes – Milan Szabo Dec 01 '22 at 19:10

0 Answers0