4

I'm trying to load a directory of parquet files in spark but can't seem to get it to work...this seems to work:

val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=20151102")

but this doesn't work:

val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*")

it gives me back this error:

java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*

how do i get it to work with a wild card?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
lightweight
  • 3,227
  • 14
  • 79
  • 142
  • You can use one of the solutions in http://stackoverflow.com/questions/794381/how-to-find-files-that-match-a-wildcard-string-in-java to turn the wildcard into a list of filenames that exist on your system. – Hellmar Becker Nov 21 '15 at 15:32
  • 3
    What version of Spark? This is supposed to be fixed.. – Marius Soutier Nov 21 '15 at 17:27

2 Answers2

7

you can read in the list of files or folders using the filesystem list status. Then go over the files/folders you want to read. Use a reduce with union to reduce all files into one single rdd.

Get the files/folders:

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))

Read in the data:

val parquetFiles= status .map(folder => {
    sqlContext.read.parquet(folder.getPath.toString)
  })

Merge the data into single rdd:

val mergedFile= parquetFiles.reduce((x, y) => x.unionAll(y))

You can also have a look at my past posts around the same topic.

Spark Scala list folders in directory

Spark/Scala flatten and flatMap is not working on DataFrame

Community
  • 1
  • 1
AlexL
  • 761
  • 1
  • 6
  • 20
2

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.

like:

basePath="hdfs://nameservice1/data/rtl/events/stream"

sparkSession.read.option("basePath", basePath).parquet(basePath + "loaddate=201511*")
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
bruse
  • 31
  • 4