I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) => name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely inefficient to set up possibly of hundreds of map jobs and then iterate through them. In this case, would the lazy valuation in spark optimize the data shuffling between all of the maps? If not, can someone please recommend a possibly more efficient way to approach this issue?
Thank you for your help