How can I read multiple parquet files in spark scala

Question

Below are some folders, which might keep updating with time. They have multiple .parquet files. How can I read them in a Spark dataframe in scala ?

"id=200393/date=2019-03-25"
"id=200393/date=2019-03-26"
"id=200393/date=2019-03-27"
"id=200393/date=2019-03-28"
"id=200393/date=2019-03-29" and so on ...

Note:- There could be 100 date folders, I need to pick only specific(let's say for 25,26 and 28)

Is there any better way than below ?

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._

val spark = SparkSession.builder.appName("ScalaCodeTest").master("yarn").getOrCreate()
val parquetFiles = List("id=200393/date=2019-03-25", "id=200393/date=2019-03-26", "id=200393/date=2019-03-28")

spark.read.format("parquet").load(parquetFiles: _*)

The above code is working but I want to do something like below-

val parquetFiles = List()
parquetFiles(0) = "id=200393/date=2019-03-25"
parquetFiles(1) = "id=200393/date=2019-03-26"
parquetFiles(2) = "id=200393/date=2019-03-28"
spark.read.format("parquet").load(parquetFiles: _*)

This can be similar to - https://stackoverflow.com/questions/33650421/reading-dataframe-from-partitioned-parquet-file — sangam.gavini, Oct 04 '19 at 17:43

firsni · Answer 1 · 2019-10-07T10:18:29.550

7

you can read it this way to read all folders in a directory id=200393:

val df  = spark.read.parquet("id=200393/*")

If you want to select only some dates, for example only september 2019:

val df  = spark.read.parquet("id=200393/2019-09-*")

If you have some special days, you can have the list of days in a list

  val days = List("2019-09-02", "2019-09-03")
  val paths = days.map(day => "id=200393/" ++ day)
  val df = spark.read.parquet(paths:_*)

edited Oct 07 '19 at 10:18

answered Oct 04 '19 at 17:48

firsni

856
6
12

As I mentioned, dates might keep changing . I should add specify more may be. Dates folder could be 100 and I have to add (let's say only) 3 out of them.Unlike, In your solution, Astrike(*) will include all 100 dates. – Mradula Ghatiya Oct 07 '19 at 09:17
here in the code snippet I select all dates with id = 200393. Do you want to select only some ? – firsni Oct 07 '19 at 09:18
@Mradula does my answer respond to your question ? – firsni Oct 08 '19 at 10:01
Yes, I was looking for a solution, what you have mentioned as your 3rd answer. This is a better way of loading data for specific dates. Thanks – Mradula Ghatiya Oct 09 '19 at 14:47
glad I helped you. :) if you can mark the question as correct. Thx – firsni Oct 09 '19 at 18:17
I had tried but because I am a new user here, there are some reputation constraints, which doesn't show up my feedback while marking answer correct. – Mradula Ghatiya Oct 15 '19 at 11:37

score 0 · Answer 2 · answered Oct 04 '19 at 21:34

0

If you want to keep the column 'id', you could try this:

val df = sqlContext
     .read
     .option("basePath", "id=200393/")
     .parquet("id=200393/date=*")

answered Oct 04 '19 at 21:34

seiya

1,477
3
17
26

As I mentioned, dates might keep changing . I should add specify more may be. Dates folder could be 100 and I have to add (let's say only) 3 out of them.Unlike, In your solution, Astrike(*) will include all 100 dates. – Mradula Ghatiya Oct 07 '19 at 09:19

How can I read multiple parquet files in spark scala

Below are some folders, which might keep updating with time. They have multiple .parquet files. How can I read them in a Spark dataframe in scala ?

2 Answers2

Linked