1

I have data inside a folder which is created every day.

Ex:Below is the format of the data folders present in AWS S3 for the whole year(2017) ie 365 folders

student_id=20170415
student_id=20170416
student_id=20170417
student_id=20170418

Each folder has multiple partitions of data in parquete format.

Now i would like to read only past 6 months(180 days/180 folders) of data and perform some logic on few columns.

How to read past 180 folders into a single Dataframe and i dont want to use unions (ie dont want to read each day data folder separately into each separate dataframe and union all later into giant Dataframe, nope i dont want to do that).

Im using Spark 2.0 & Scala

shiv455
  • 7,384
  • 19
  • 54
  • 93

1 Answers1

1

You can make a regex for the directoryname something like below and use either SparkSession.read if you want only the content from the file or sparkContext.wholeTextFiles if you want a [K,V] pair like [filename,record]

val inputpath="/root/path/todir/2015{0[1-6]}[0-3]*"//reads only first six months folder(01-06)
spark.sparkContext.wholeTextFiles(inputpath).toDF().show(1000) //Gives [K,V]
val data=spark.read.csv(inputpath).show(10000) //Gives only content

Both would result in a single DF. For sparkSession.read the size will be number of folders * number of lines in each file and for SparkContext.wholeTextFiles it would be number of folders * number of files

Vignesh I
  • 2,211
  • 2
  • 20
  • 40