1

Below are some folders, which might keep updating with time. They have multiple .parquet files. How can I read them in a Spark dataframe in scala ?

  • "id=200393/date=2019-03-25"
  • "id=200393/date=2019-03-26"
  • "id=200393/date=2019-03-27"
  • "id=200393/date=2019-03-28"
  • "id=200393/date=2019-03-29" and so on ...

Note:- There could be 100 date folders, I need to pick only specific(let's say for 25,26 and 28)

Is there any better way than below ?

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._

val spark = SparkSession.builder.appName("ScalaCodeTest").master("yarn").getOrCreate()
val parquetFiles = List("id=200393/date=2019-03-25", "id=200393/date=2019-03-26", "id=200393/date=2019-03-28")

spark.read.format("parquet").load(parquetFiles: _*)

The above code is working but I want to do something like below-

val parquetFiles = List()
parquetFiles(0) = "id=200393/date=2019-03-25"
parquetFiles(1) = "id=200393/date=2019-03-26"
parquetFiles(2) = "id=200393/date=2019-03-28"
spark.read.format("parquet").load(parquetFiles: _*)
Community
  • 1
  • 1

2 Answers2

7

you can read it this way to read all folders in a directory id=200393:

val df  = spark.read.parquet("id=200393/*")

If you want to select only some dates, for example only september 2019:

val df  = spark.read.parquet("id=200393/2019-09-*")

If you have some special days, you can have the list of days in a list

  val days = List("2019-09-02", "2019-09-03")
  val paths = days.map(day => "id=200393/" ++ day)
  val df = spark.read.parquet(paths:_*)
firsni
  • 856
  • 6
  • 12
  • As I mentioned, dates might keep changing . I should add specify more may be. Dates folder could be 100 and I have to add (let's say only) 3 out of them.Unlike, In your solution, Astrike(*) will include all 100 dates. – Mradula Ghatiya Oct 07 '19 at 09:17
  • here in the code snippet I select all dates with id = 200393. Do you want to select only some ? – firsni Oct 07 '19 at 09:18
  • @Mradula does my answer respond to your question ? – firsni Oct 08 '19 at 10:01
  • Yes, I was looking for a solution, what you have mentioned as your 3rd answer. This is a better way of loading data for specific dates. Thanks – Mradula Ghatiya Oct 09 '19 at 14:47
  • glad I helped you. :) if you can mark the question as correct. Thx – firsni Oct 09 '19 at 18:17
  • I had tried but because I am a new user here, there are some reputation constraints, which doesn't show up my feedback while marking answer correct. – Mradula Ghatiya Oct 15 '19 at 11:37
0

If you want to keep the column 'id', you could try this:

val df = sqlContext
     .read
     .option("basePath", "id=200393/")
     .parquet("id=200393/date=*")
seiya
  • 1,477
  • 3
  • 17
  • 26
  • As I mentioned, dates might keep changing . I should add specify more may be. Dates folder could be 100 and I have to add (let's say only) 3 out of them.Unlike, In your solution, Astrike(*) will include all 100 dates. – Mradula Ghatiya Oct 07 '19 at 09:19