I have RDD, where each record is int:
[0,1,2,3,4,5,6,7,8]
All I need to do is split this RDD into batches. I.e. make another RDD where each element is fixed size list of elements:
[[0,1,2], [3,4,5], [6,7,8]]
This sounds trivial, however, I am puzzled last several days and cannot find anything except the following solution:
Use ZipWithIndex to enumerate records in RDD:
[0,1,2,3,4,5] -> [(0, 0),(1, 1),(2, 2),(3, 3),(4, 4),(5, 5)]
Iterate over this RDD using map() and calculate index like
index = int(index / batchSize)
[1,2,3,4,5,6] -> [(0, 0),(0, 1),(0, 2),(1, 3),(1, 4),(1, 5)]
Then group by generated index.
[(0, [0,1,2]), (1, [3,4,5])]
This will get me what I need, however, I do not want to use group by here. It is trivial when you are using plain Map Reduce or some abstraction like Apache Crunch. But is there a way to produce similar result in Spark without using heavy group by?