I have an RDD[(String, (String, Double))]
. Currently the RDD contains duplicates in the key field. I want to get an RDD[(String, Map[String, Double])]
(does not need to be a vanilla map, just some kind of fast lookup structure) where the first field has no duplicates (i.e. the (String, Double)
values have been collected for each key).
Currently I have
result = startingRDD.map(x => (x._1, List(x._2._1, x._2._2))
.reduceByKey(_++_)
.map(x => (x._1, x._2.toMap))
This does what I want, but I am concerned that it is very slow (list concatenation is O(n), converting to the map at the end seems like it could be avoided, probably other obvious things I am missing).
How can I implement this logical operation in the most efficient way?
I am also concerned that I cannot find any references to efficient concatenation by key in Spark. Am I just approaching the whole problem incorrectly?