I am currently writing a program where I am deciding whether to use a groupByKey followed by a join or simply a join.
Essentially I have one RDD with many values per key and another RDD with only one value per key, but that value is very large. My question is when I join those values together, would spark end up making a lot of copies of the large value (one for every single instance of the smaller value) or would spark keep only one copy of the large value while giving references to all of the original values.
Essentially I'd have a situation like this:
val InvIndexes:RDD[(Int,InvertedIndex)] //InvertedIndex is very large
val partitionedVectors:RDD[(Int, Vector)]
val partitionedTasks:RDD[(Int, (Iterator[Vector], InvertedIndex))] = partitionedvectors.groupByKey().join(invIndexes)
val similarities = partitionedTasks.map(//calculate similarities)
My question if there would actually be any major space complexity difference between the code before and doing this:
val InvIndexes:RDD[(Int,InvertedIndex)]
val partitionedVectors:RDD[(Int, Vector)]
val partitionedTasks:RDD[(Int, (Vector, InvertedIndex))] = partitionedvectors.join(invIndexes)
val similarities = partitionedTasks.map(//calculate similarities)