How to use sqlContext to load multiple parquet files?

Question

I'm trying to load a directory of parquet files in spark but can't seem to get it to work...this seems to work:

val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=20151102")

but this doesn't work:

val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*")

it gives me back this error:

java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*

how do i get it to work with a wild card?

You can use one of the solutions in http://stackoverflow.com/questions/794381/how-to-find-files-that-match-a-wildcard-string-in-java to turn the wildcard into a list of filenames that exist on your system. — Hellmar Becker, Nov 21 '15 at 15:32

score 7 · Answer 1 · edited May 23 '17 at 12:18

you can read in the list of files or folders using the filesystem list status. Then go over the files/folders you want to read. Use a reduce with union to reduce all files into one single rdd.

Get the files/folders:

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))

Read in the data:

val parquetFiles= status .map(folder => {
    sqlContext.read.parquet(folder.getPath.toString)
  })

Merge the data into single rdd:

val mergedFile= parquetFiles.reduce((x, y) => x.unionAll(y))

You can also have a look at my past posts around the same topic.

Spark Scala list folders in directory

Spark/Scala flatten and flatMap is not working on DataFrame

score 2 · Answer 2 · edited Dec 19 '17 at 03:34

2

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.

like:

basePath="hdfs://nameservice1/data/rtl/events/stream"

sparkSession.read.option("basePath", basePath).parquet(basePath + "loaddate=201511*")

edited Dec 19 '17 at 03:34

Stephen Rauch

47,830
31
106
135

answered Dec 19 '17 at 03:17

bruse

31
4

How to use sqlContext to load multiple parquet files?

2 Answers2