Java / Spark: want to compare each to each in an RDD but not with Cartesian

Question

We have a RDD of N = 10^6 elements. We know that to compare each element to each other element will take N-squared = 10^12 operations and decided to use Spark to accomplish this on a cluster.

I don't believe we actually need to produce a Cartesian set, there's no reason we need a set like {(a,a),(a,b),(b,a),(b,b)} to stick around. If not persisted I see Spark would get rid of it once its useful life is done, but I'd rather not let it live in the first place :) The Cartesian obviously takes a lot more memory and we'd like to avoid that.

Is there not a way in Spark to iterate the way we want without creating a Cartesian product of the same RDD by itself?

There must be something, I have been looking at the by partition type functions.

I am thinking, based on the chat session linked below, that assigning an "artificial" key to subsets of RDD elements, evenly divided across workers on partitions, then starting to compare by key partition by partition until it's all compared.

NOTES: For what it's worth we can use a JavaPairRDD and have the DropResult be the index, but it's not necessary to compare DropResults to all other DropResults in any particular order, as long as each one gets compared to all the others. Thanks.

(NOTE: I don't think using a DataFrame would work because these are custom classes, are DataFrames not for pre-defined SQL-like datatypes?

And before anyone suggests it, our target cluster is currently running 1.4.1 and it's out of our control so if Datasets are useful I'd like to know but don't know when I could take advantage of that)

I have looked at these other questions including a couple I asked but they don't cover this specific case:

How to compare every element in the RDD with every other element in the RDD ?

Spark - Nested RDD Operation

Comparing Subsets of an RDD *** interesting chat leads off this question!!! https://chat.stackoverflow.com/rooms/99735/discussion-between-zero323-and-daniel-imberman

THESE I asked about different subjects, mostly how to control creation of RDDs to a desired size:

https://stackoverflow.com/questions/34339300/nesting-parallelizations-in-spark-whats-the-right-approach

https://stackoverflow.com/questions/34279781/in-apache-spark-can-i-easily-repeat-nest-a-sparkcontext-parallelize

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-meet-nested-loop-on-pairRdd-td21121.html

And I don't believe this question duplicates http://stackoverflow.com/questions/34598464/comparing-subsets-of-an-rdd because we are looking at comparing all elements by all elements, not subsets x subsets. — JimLohse, Jan 07 '16 at 22:03
I don't understand what's the problem with `cartesian`? It generates the *exact* same pairs as a nested for loop would... — yurib, Jan 07 '16 at 22:07
Well the nested for loop reference isn't possible in Spark was just meant to show what I am trying to do. And if we were in plain old Java a nested for loop would not create a totally new data structure containing every possible combination, would it? — JimLohse, Jan 07 '16 at 22:12
I realize it would do that in a transient fashion, in the sense of comparing every value to every value, but I don't understand that Java ( or C++ or C# or...) would keep that whole ugly data structure around for the entire running of a for loop? I really don't know but assume not. — JimLohse, Jan 07 '16 at 22:16
I think you're wrong in assuming that spark would hold the entire pairwise product in memory. It is designed specifically for such tasks and can perform them in an efficient manner. — yurib, Jan 08 '16 at 19:07
I truly appreciate the comment, am in mistaken in this thinking? Of N=10^6 DropResults in an RDD, where every DropResult needs to be compared to every other DropResult (or int or String or whatever type), Spark has to keep every element of the RDD until the last element has been compared? Sure on the run of element N, Spark can start deconstructing the RDD, but as long as some worker has not done comparison N yet, the RDD sticks around, whether in memory or serialized? — JimLohse, Jan 08 '16 at 20:06
So either way just for pure computer science reasons let's assume Spark does keep it around? It's looking to me like getting this thing spread across partitions (by key unless there's a comparePartitiontoPartition function) then compare every partition to every other partition? I am hoping this structure will reduce or eliminate shuffling. I expect a LOT of network traffic but hope not to shuffle. Thanks to anyone who can post an answer, if not I will post our solution when we have it. — JimLohse, Jan 08 '16 at 20:10
What we finally did, @rado, is to break the 10^6 elements into chunks in Java in the driver, such that we would loop through maybe 100,000 at a time. Also the nature of the problem changed where we only needed to compare the 10^6 elements, 100,000 or so at a time, to a reference set of 2000 similar elements. The solution was to use an Accumulable with a customer AccumulableParam and setup the .add() function of the Accumulator to do the comparison. So we would create an RDD of the 100,000, compare all of those to all 2000 Accumulables, throw away the 100,000, build the next 100,000 and repeat. — JimLohse, Aug 18 '16 at 00:18
I should post a proper answer here, I would say for overview, we were thinking like imperative Java programmers trying to fit a round Java peg into a square functional-programming Spark hole. (sorry if round peg in square hole is too much English idiom, means it won't work well if at all). So I would say if my team was more able to be flexible and think in a functional manner this probably would have gone more smoothly. Thanks for asking! — JimLohse, Aug 18 '16 at 00:20

Java / Spark: want to compare each to each in an RDD but not with Cartesian

0 Answers0