I need to process a large collection of items. Every item is processed in the same way and is independent of the other items (maps
on an rdd
).
Depending on the path taken in the program different types of information are generated for the items in map
operations. Subsequent operations can then take advantage of this information that is already present to execute in a most efficient manner.
Here I have to make a design choice of how to keep the generated information associated with the items.
My current approach to achieve this is to return tuples which contain the original information passed to the map
and the generated information. I keep adding information like this so that in the end I have all the information available in a single rdd
.
This works but I find it would be nicer to have the information in separate rdds
. As far as I know there is no way to keep the information generated in a map
as a separate rdd
associated with the corresponding items that were passed into the map
(without using ids). And thus there is no way of either combining two rdds
or do operations on two rdds
respecting the association.
Is there a mechanism in spark that allows you to store the information generated from your distributed items in a separate rdd
but conserving the association with the distributed items?