Aggregate S3 data for Spark operation

Question

I have data in S3 that's being written there with a directory structure as follows: YYYY/MM/DD/HH

I am trying to write a program which will take in a start date, aggregate the data between those dates, and then convert it to spark RDD to perform operations on.

Is there a clean way to do this aggregation with iterating over every combination of YYYY/MM/DD/HH and without having to do a tree to find where it's safe to use wildcards?

I.e. if input is start= 2014/01/01/00 end=2016/02/01/00 do I have to access each bucket individually (i.e. 2014///, 2015///, 2016/1//)?

score 0 · Answer 1 · answered Jun 01 '17 at 11:49

are you using dataframe? If that is the case then you can leverage the loading data from multiple source API. Take a look at this documentation: https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.DataFrameReader

def load(paths: String*): DataFrame

the above method supports multiple source

Also discussion from this stack overflow thread might be helpful for you: Reading multiple files from S3 in Spark by date period

Aggregate S3 data for Spark operation

1 Answers1