EDIT: I have a collection of vectors and am trying to compute a pairwise relation across each vector with every other vectors. I then need to group the results for each vector. The approach I'm trying is as follows (I understand that it computes each pair 2x):
Option 1:
val myRDD: RDD[MyType]
val grouped: RDD[(MyType, List[MyVector])] = myRDD.cartesian(myRDD)
.mapValues(List(_))
.reduceByKey( (x,y) => x:::y ) // or groupBy(_).mapValues(_.toList)
Option 2:
val items: Array[MyType] = for (row <- myRDD.collect) yield row
val grouped: RDD[(MyType, List[MyVector])] = myRDD.map(x => (x, items.map(y => (x, y)).toList))
Option 1 seems like the natural choice, but what I'm finding is that even for very small sets, e.g., ~500 elements, with each element for example a list of one hundred Doubles, the reduceByKey (or groupBy, which I've also tried) maps to 40000 ShuffleMapTasks that complete at a rate of about 10 per second. After about 30 minutes, when approx. 1/4 are done, the job fails with a GC out of memory error. Is there a way to ensure that the cartesian product preserves partitions? Is there a more efficient way to handle the reduce task? I've also tried different keys (e.g., Ints), but there was no improvement.
Option 2 is very fast for my particular case because the collection can fit in memory, but of course it seems like a poor choice for larger collections.
I've seen a few similar questions, e.g.,
https://groups.google.com/forum/#!topic/spark-users/TZla5TnAMTU
Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?
I'm sure others have run into this particular problem, and I'd really appreciate any pointers! Thank you.