I've been working quite a lot with Apache Spark the last few months but now I have received a pretty difficult task, to compute average/minimum/maximum etcetera on a sliding window over a paired RDD
where the Key component is a date tag and the value component is a matrix. So each aggregation function should also return a matrix, where for each cell the average for all of that cell in the time period is averaged.
I want to be able to say that I want the average for every 7 days, with a sliding window of one day. The sliding window movement unit is always one, and then the unit of the size of the window (so if it's every 12 weeks, the window movement unit is 1).
My initial thought now is to simply iterate, if we want an average per X days, X times, and for each time just group the elements by it's date, with an offset.
So if we have this scenario:
Days: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Matrices: A B C D E F G H I J K L M N O
And we want the average per 5 days, I will iterate 5 times and show the grouping here:
First iteration:
Group 1: (1, A) (2, B) (3, C) (4, D) (5, E)
Group 2: (6, F) (7, G) (8, H) (9, I) (10, J)
Group 3: (11, K) (12, L) (13, M) (14, N) (15, O)
Second iteration:
Group 1: (2, B) (3, C) (4, D) (5, E) (6, F)
Group 2: (7, G) (8, H) (9, I) (10, J), (11, K)
Group 3: (12, L) (13, M) (14, N) (15, O)
Etcetera, and for each group, I have to do a fold/reduce procedure to get the average.
However as you might imagine, this is pretty slow and probably a rather bad way to do it. I can't really figure out any better way to do it though.