I have 2 RDDs with the same key, but different value types (more than 2 values). I want to join these RDDs by key, and append their values next in the final tuple (see below). What's the best way to do this?
rdd1 = sc.parallelize([ (1, "test1", [5,6,7]), (2, "test2", [1,2,3]) ])
rdd2 = sc.parallelize([ (1, "Foo"), (2, "Bar") ])
Desired Output RDD
[ (1, "Foo", "test1", [5,6,7]), (2, "Bar", "test2", [1,2,3]) ]
Doing a direct join does not work:
print(rdd2.join(rdd1).collect())
#[(1, ('Foo', 'test1')), (2, ('Bar', 'test2'))]
This ignores the the rest of the values in rdd1
and the output is in the wrong format.