2

I want to read Azure Blob storage files into spark using databricks. But I do not want to set a specific file or * for each level of nesting.

The standard: is **/*/ not working. These work just fine:

val df = spark.read.format("avro").load("dbfs:/mnt/foo/my_file/0/2019/08/24/07/54/10.avro")
val df = spark.read.format("avro").load("dbfs:/mnt/foo/my_file/*/*/*/*/*/*")

fails with:

java.io.FileNotFoundException: No Avro files found. If files don't have .avro extension, set ignoreExtension to true

for

val df = spark.read.format("avro").load("dbfs:/foo/my_file/test/**/*")
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

1

Spark by default reads recursively down - so you only need to point at the root folder:

val df = spark.read.format("avro").load("dbfs:/foo/my_file/test/")

The path value is actually a regex.

** Does nothing

* will work - though it is usually done in the form {*}, known as globbing. This is worth a read:How to use regex to include/exclude some input files in sc.textFile?

simon_dmorias
  • 2,343
  • 3
  • 19
  • 33