As I understand distinct() hash partitions the RDD to identify the unique keys. But does it optimize on moving only the distinct tuples per partition?
Imagine an RDD with the following partitions
- [1, 2, 2, 1, 4, 2, 2]
- [1, 3, 3, 5, 4, 5, 5, 5]
On a distinct on this RDD, would all the duplicate keys (2s in partition 1 and 5s in partition 2) get shuffled to their target partition or will only the distinct keys per partition get shuffled to the target?
If all keys get shuffled then an aggregate() with set() operations will reduce the shuffle.
def set_update(u, v):
u.add(v)
return u
rdd.aggregate(set(), set_update, lambda u1,u2: u1|u2)