0

Given a set U, which is stored in RDD named rdd.

What is the recommended way to merge any given RDD rdd_not_set and rdd such that the resultant rdd is also an set.

rdd = sc.union([rdd, U])
rdd = rdd.reduceBykey(reduce_func)

Ex: rdd = sc.parallelize([(1,2), (2,3)]) and rdd_not_set = sc.parallelize([(1,4), (3,4)]) and resultant final_rdd = sc.parallelize([(1,4), (2,3), (3,4)])

Naive solution is to perform union and then reduceByKey which would be very inefficient as rdd will be huge in size.

letsBeePolite
  • 2,183
  • 1
  • 22
  • 37
  • By `set`, you mean that the first element of the `tuple` is the primary key? So if there's a duplicated key in `rdd_not_set`, you want to update with the new value? – pault Oct 01 '18 at 14:16
  • Related (for DataFrames): [How to update a pyspark dataframe with new values from another dataframe?](https://stackoverflow.com/questions/50295783/how-to-update-a-pyspark-dataframe-with-new-values-from-another-dataframe) – pault Oct 01 '18 at 14:20

0 Answers0