We have a RDD of N = 10^6 elements. We know that to compare each element to each other element will take N-squared = 10^12 operations and decided to use Spark to accomplish this on a cluster.
I don't believe we actually need to produce a Cartesian set, there's no reason we need a set like {(a,a),(a,b),(b,a),(b,b)} to stick around. If not persisted I see Spark would get rid of it once its useful life is done, but I'd rather not let it live in the first place :) The Cartesian obviously takes a lot more memory and we'd like to avoid that.
Is there not a way in Spark to iterate the way we want without creating a Cartesian product of the same RDD by itself?
There must be something, I have been looking at the by partition type functions.
I am thinking, based on the chat session linked below, that assigning an "artificial" key to subsets of RDD elements, evenly divided across workers on partitions, then starting to compare by key partition by partition until it's all compared.
NOTES: For what it's worth we can use a JavaPairRDD and have the DropResult be the index, but it's not necessary to compare DropResults to all other DropResults in any particular order, as long as each one gets compared to all the others. Thanks.
(NOTE: I don't think using a DataFrame would work because these are custom classes, are DataFrames not for pre-defined SQL-like datatypes?
And before anyone suggests it, our target cluster is currently running 1.4.1 and it's out of our control so if Datasets are useful I'd like to know but don't know when I could take advantage of that)
I have looked at these other questions including a couple I asked but they don't cover this specific case:
How to compare every element in the RDD with every other element in the RDD ?
Comparing Subsets of an RDD *** interesting chat leads off this question!!! https://chat.stackoverflow.com/rooms/99735/discussion-between-zero323-and-daniel-imberman
THESE I asked about different subjects, mostly how to control creation of RDDs to a desired size:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-meet-nested-loop-on-pairRdd-td21121.html