Attaching information to distributed items in Apache Spark

Question

I need to process a large collection of items. Every item is processed in the same way and is independent of the other items (maps on an rdd).

Depending on the path taken in the program different types of information are generated for the items in map operations. Subsequent operations can then take advantage of this information that is already present to execute in a most efficient manner.

Here I have to make a design choice of how to keep the generated information associated with the items.

My current approach to achieve this is to return tuples which contain the original information passed to the map and the generated information. I keep adding information like this so that in the end I have all the information available in a single rdd.

This works but I find it would be nicer to have the information in separate rdds. As far as I know there is no way to keep the information generated in a map as a separate rdd associated with the corresponding items that were passed into the map (without using ids). And thus there is no way of either combining two rdds or do operations on two rdds respecting the association.

Is there a mechanism in spark that allows you to store the information generated from your distributed items in a separate rdd but conserving the association with the distributed items?

It is far to broad and not exactly clear question but you can start with analyzing GraphX code and how it separates vertices and edges. — zero323, Feb 09 '16 at 17:08
A more clearer question would be: Are there operations to combine rdds which respect the association of information. I'm looking for something like JavaRDD op(JavaRDD, JavaRDD) — Jonathan, Feb 09 '16 at 17:16

score 0 · Answer 1 · edited May 23 '17 at 12:15

0

The RDD.zip() method provides the needed functionality but its use doesn't seem to be encouraged because its assumptions are easy to violate. Ordering must be preserved.

Another way to solve the issue is to use keys (ids) and join as suggested here.

edited May 23 '17 at 12:15

Community

1
1

answered Feb 10 '16 at 06:16

Jonathan

358
3
14

Attaching information to distributed items in Apache Spark

1 Answers1