Ignore empty folders when reading with Spark wholeTextFiles

Asked Jul 19 '18 at 20:16

Active Jul 19 '18 at 22:16

Viewed 476 times

I'm using wholeTextFiles to read a bunch of xml files from different folders and some of these folders might be empty. Unfortunately Spark throws an exception if any of these folders are empty:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern file:/path/*/*/*.xml matches 0 files

I've seen some ways of workaround this issue when dealing with regular RDDs, like this one, but I couldn't find anything similar when using wholeTextFiles.

I've looked a bit into Spark code and this method uses a bunch of private classes, so it seems hard to change the behaviour. Any ideas?

edited Jul 19 '18 at 22:16

zero323

322,348
103
959
935

asked Jul 19 '18 at 20:16

Luciano

Instead of using wildcards, consider using Hadoop configuration. – zero323 Jul 19 '18 at 22:17
What do you mean by using Hadoop configuration instead of wildcards? – Luciano Jul 20 '18 at 07:42
https://stackoverflow.com/a/48174564/6910411 – zero323 Jul 20 '18 at 23:51

Ignore empty folders when reading with Spark wholeTextFiles

0 Answers0