In my Spark Scala application I have an RDD with the following format:
(05/05/2020, (name, 1))
(05/05/2020, (name, 1))
(05/05/2020, (name2, 1))
...
(06/05/2020, (name, 1))
What I want to do is group these elements by date and sum the tuples that have the same "name" as key.
Expected Output:
(05/05/2020, List[(name, 2), (name2, 1)]),
(06/05/2020, List[(name, 1)])
...
In order to do that, I am currently using a groupByKey
operation and some extra transformations in order to group the tuples by key and calculate the sum for those that share the same one.
For performance reasons, I would like to replace this groupByKey
operation with a reduceByKey
or an aggregateByKey
in order to reduce the amount of data transferred over the network.
However, I can't get my head around on how to do this. Both of these transformations take as parameter a function between the values (tuples in my case) so I can't see how I can group the tuples by key in order to calculate their sum.
Is it doable?