1

I have an RDD[(String, (String, Double))]. Currently the RDD contains duplicates in the key field. I want to get an RDD[(String, Map[String, Double])] (does not need to be a vanilla map, just some kind of fast lookup structure) where the first field has no duplicates (i.e. the (String, Double) values have been collected for each key).

Currently I have

    result = startingRDD.map(x => (x._1, List(x._2._1, x._2._2))
    .reduceByKey(_++_)
    .map(x => (x._1, x._2.toMap))

This does what I want, but I am concerned that it is very slow (list concatenation is O(n), converting to the map at the end seems like it could be avoided, probably other obvious things I am missing).

How can I implement this logical operation in the most efficient way?

I am also concerned that I cannot find any references to efficient concatenation by key in Spark. Am I just approaching the whole problem incorrectly?

zero323
  • 322,348
  • 103
  • 959
  • 935
Nell
  • 11
  • 2
  • Use `aggrregateByKey` and build the map directly. Also, `groupByKey` does your `map` and `reduceByKey` in one step "Group the values for each key in the RDD into a single sequence." – The Archetypal Paul Aug 08 '16 at 06:44
  • Possible duplicate of [Spark performance for Scala vs Python](http://stackoverflow.com/questions/32464122/spark-performance-for-scala-vs-python) – zero323 Aug 08 '16 at 10:51
  • @zero323 I was not sure how to respond to your "possible duplicate" since I don't think that my response will enhance my question. The main reason I believe this question is not a duplicate is that I was unable to find SO questions relating to efficient concatenation in Spark. The question referenced in the "possible duplicate" is much broader and does not turn up in searches such as "efficient spark concatenate". – Nell Aug 09 '16 at 01:18

0 Answers0