For example, there are two rdds such as "rdd1 = [[1,2],[3,4]], rdd2 = [[5,6],[7,8]]". And how to combine both into this style: [[1,2,5,6],[3,4,7,8]]. Is there any function can solve this problem?
Asked
Active
Viewed 4,369 times
2
-
1Possible duplicate of [Spark Dataset API - join](https://stackoverflow.com/questions/36462674/spark-dataset-api-join) – Prasad Khode Oct 27 '17 at 07:09
-
Seeing the example i don't think join works since i couldn't find a key to join – maxmithun Oct 27 '17 at 07:22
2 Answers
5
You need to basically combine your rdds together using rdd.zip()
and perform map
operation on the resulting rdd to get your desired output :
rdd1 = sc.parallelize([[1,2],[3,4]])
rdd2 = sc.parallelize([[5,6],[7,8]])
#Zip the two rdd together
rdd_temp = rdd1.zip(rdd2)
#Perform Map operation to get your desired output by flattening each element
#Reference : https://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python
rdd_final = rdd_temp.map(lambda x: [item for sublist in x for item in sublist])
#rdd_final.collect()
#Output : [[1, 2, 5, 6], [3, 4, 7, 8]]
You can also check out the results on the Databricks notebook at this link.

Gambit1614
- 8,547
- 1
- 25
- 51
2
Another (longer) way to achieve this using rdd join:
rdd1 = sc.parallelize([[1,2],[3,4]])
rdd2 = sc.parallelize([[5,6],[7,8]])
# create keys for join
rdd1=rdd1.zipWithIndex().map(lambda (val, key): (key,val))
rdd2=rdd2.zipWithIndex().map(lambda (val, key): (key,val))
# join and flatten output
rdd_joined=rdd1.join(rdd2).map(lambda (key, (val1, val2)): val1+val2)
rdd_joined.take(2)

ags29
- 2,621
- 1
- 8
- 14