I'm working with Azure Databricks where I am stream reading files from an Azure Datalake container. I am using the autoloader functionality with .format("cloudFiles").
The file structure in the container looks like this:
- ../accounts/*.csv
- ../accounts/snapshot/*.csv
Both folders (files and snapshot) contains lots of csv files and while for the loading path i'm only specifying "../files" anything that is in snapshot is also getting ingested by the autoloader.
My problem is: I want the autoloader to ignore any subfolder, but I have no clue how can that be achieved.
For technical restrictions this snapshot folder cannot be moved as it is produced by the MS Dynamics365 data export functionality; essentially it creates its own file structure and doesn't allow for much configuration.
Thorough testing proved that the subfolder is getting picked up, but we have no idea why. Is it a default behaviour of autoloader?
The options I am using for this:
- ("cloudFiles.useNotifications", True) -> this is an implementational requirement. Could this be the problem?
- ("recursiveFileLookup", False) -> The default value is false, but the existence of this option suggests that reading subfolders is not the default behaviour of autoloader.
As a workaround I can call the .option("pathGlobFilter", ?????.csv) which then filters for the files in account which are different from snapshot, but there has to be something else that resolves this.