Can we exclude or include only particular file extensions from Databricks Autoloader?

Question

Right now the databricks autoloader requires a directory path where all the files will be loaded from. But in case some other kind of log files also start coming in in that directory - is there a way to ask Autoloader to exclude those files while preparing dataframe?

df = spark.readStream.format("cloudFiles") \
  .option(<cloudFiles-option>, <option-value>) \
  .schema(<schema>) \
  .load(<input-path>)

score 5 · Accepted Answer · answered Jul 10 '21 at 11:57

5

Autoloader supports specification of the glob string as <input-path> - from documentation:

<input-path> can contain file glob patterns

Glob syntax support different options, like, * for any character, etc. So you can specify input-path as, path/*.json for example. You can exclude files as well, but building that pattern could be slightly more complicated, compared to inclusion pattern, but it's still possible - for example, *.[^l][^o][^g] should exclude files with .log extension

answered Jul 10 '21 at 11:57

Alex Ott

80,552
8
87
132

1

Alex this was really helpful. I though this wont work with sub nested dynamic generated folders but it worked for that as well. This solved the issue for me! – Keshav Agrawal Jul 16 '21 at 09:08

score -1 · Answer 2 · answered Dec 07 '21 at 14:47

-1

Use pathGlobFilter as one of the option and provide a regex to filter a file type or file with specific name.

For instance, to skip files with filename as A1.csv, A2.csv .... A9.csv from load location, the value for pathGlobFilter will look like:

df = spark.read.load("/file/load/location,
                     format="csv", 
                     schema=schema, 
                     pathGlobFilter="A[0-9].csv")

answered Dec 07 '21 at 14:47

Aman Sehgal

546
4
13

Fromd docs: "`pathGlobFilter` is used to only **include** files with file names matching the pattern", so this won't skip these files – Alex Ott Dec 07 '21 at 16:25

Can we exclude or include only particular file extensions from Databricks Autoloader?

2 Answers2