I have the input data of the format
RDD[
(Map1, RecordA),
(Map2, RecordX),
(Map1, RecordB),
(Map2, RecordY),
(Map1, RecordC),
(Map2, RecordZ)
]
Expected out format is (List of RDDs) :
List[
RDD[RecordA, RecordB, RecordC],
RDD[RecordX, RecordY, RecordZ]
]
I want the inner RDDs to be grouped by the key that is Map1, Map2 and I want to create a outter List as a collection of inner RDDs.
I tried using reduceByKey API and aggregateByKey API and have not been successful so far!
Real World example :
RDD[
(Map("a"->"xyz", "b"->"per"), CustomRecord("test1", 1, "abc")),
(Map("a"->"xyz", "b"->"per"), CustomRecord("test2", 1, "xyz")),
(Map("a"->"xyz", "b"->"lmm"), CustomRecord("test3", 1, "blah")),
(Map("a"->"xyz", "b"->"lmm"), CustomRecord("test4", 1, "blah"))
]
final case class CustomRecord(
string1: String,
int1: Int,
string2: String)
Appreciate your help.
>, iterate over the outer list and use sc.parallelize to convert the inner list to RDD, then collect back to List to get what you want. Or - you can take the original RDD, collect the keys, and then iterate over keys. For each key, map the original rdd to get an rdd of values for that key, and just add those rdds one by one to a list...
– Lior Chaga Dec 03 '18 at 19:51