Union with an existing RDD which is a set in pyspark

Asked Sep 29 '18 at 06:27

Active Sep 29 '18 at 06:27

Viewed 162 times

Given a set U, which is stored in RDD named rdd.

What is the recommended way to merge any given RDD rdd_not_set and rdd such that the resultant rdd is also an set.

rdd = sc.union([rdd, U])
rdd = rdd.reduceBykey(reduce_func)

Ex: rdd = sc.parallelize([(1,2), (2,3)]) and rdd_not_set = sc.parallelize([(1,4), (3,4)]) and resultant final_rdd = sc.parallelize([(1,4), (2,3), (3,4)])

Naive solution is to perform union and then reduceByKey which would be very inefficient as rdd will be huge in size.

asked Sep 29 '18 at 06:27

letsBeePolite

2,183
1
22
37

By `set`, you mean that the first element of the `tuple` is the primary key? So if there's a duplicated key in `rdd_not_set`, you want to update with the new value? – pault Oct 01 '18 at 14:16
Related (for DataFrames): [How to update a pyspark dataframe with new values from another dataframe?](https://stackoverflow.com/questions/50295783/how-to-update-a-pyspark-dataframe-with-new-values-from-another-dataframe) – pault Oct 01 '18 at 14:20

Union with an existing RDD which is a set in pyspark

0 Answers0