I have data in S3 that's being written there with a directory structure as follows: YYYY/MM/DD/HH
I am trying to write a program which will take in a start date, aggregate the data between those dates, and then convert it to spark RDD to perform operations on.
Is there a clean way to do this aggregation with iterating over every combination of YYYY/MM/DD/HH and without having to do a tree to find where it's safe to use wildcards?
I.e. if input is start= 2014/01/01/00 end=2016/02/01/00 do I have to access each bucket individually (i.e. 2014///, 2015///, 2016/1//)?