Given a set U
, which is stored in RDD named rdd
.
What is the recommended way to merge any given RDD rdd_not_set
and rdd
such that the resultant rdd
is also an set.
rdd = sc.union([rdd, U])
rdd = rdd.reduceBykey(reduce_func)
Ex: rdd = sc.parallelize([(1,2), (2,3)])
and rdd_not_set = sc.parallelize([(1,4), (3,4)])
and resultant final_rdd = sc.parallelize([(1,4), (2,3), (3,4)])
Naive solution is to perform union
and then reduceByKey
which would be very inefficient as rdd
will be huge in size.