I have data inside a folder which is created every day.
Ex:Below is the format of the data folders present in AWS S3 for the whole year(2017) ie 365 folders
student_id=20170415
student_id=20170416
student_id=20170417
student_id=20170418
Each folder has multiple partitions of data in parquete format.
Now i would like to read only past 6 months(180 days/180 folders) of data and perform some logic on few columns.
How to read past 180 folders into a single Dataframe and i dont want to use unions (ie dont want to read each day data folder separately into each separate dataframe and union all later into giant Dataframe, nope i dont want to do that).
Im using Spark 2.0 & Scala