spark read blob storage using wildcard

Question

I want to read Azure Blob storage files into spark using databricks. But I do not want to set a specific file or * for each level of nesting.

The standard: is **/*/ not working. These work just fine:

val df = spark.read.format("avro").load("dbfs:/mnt/foo/my_file/0/2019/08/24/07/54/10.avro")
val df = spark.read.format("avro").load("dbfs:/mnt/foo/my_file/*/*/*/*/*/*")

fails with:

java.io.FileNotFoundException: No Avro files found. If files don't have .avro extension, set ignoreExtension to true

for

val df = spark.read.format("avro").load("dbfs:/foo/my_file/test/**/*")

score 1 · Answer 1 · answered Aug 27 '19 at 15:01

1

Spark by default reads recursively down - so you only need to point at the root folder:

val df = spark.read.format("avro").load("dbfs:/foo/my_file/test/")

The path value is actually a regex.

** Does nothing

* will work - though it is usually done in the form {*}, known as globbing. This is worth a read:How to use regex to include/exclude some input files in sc.textFile?

answered Aug 27 '19 at 15:01

simon_dmorias

2,343
3
19
33

`val df = spark.read.format("avro").load("dbfs:/foo/my_file/test")` fails though and the trailing slash is mandatory. – Georg Heiler Aug 28 '19 at 05:29
Unfortunately, when trying it again it still fails. – Georg Heiler Aug 29 '19 at 21:25
Strange. I can even replicate this when not reading from blob storage using a local file system only. This problem only seems to be available when reading using format `avro`. – Georg Heiler Sep 13 '19 at 15:27

spark read blob storage using wildcard

1 Answers1