I have a requirement, where I have a huge dataset of over 2 Trillion records. This comes as a result of some join. And post this join, I need to aggregate on a column ('id' column) and get a list of distinct names (collect_set('name')).
Now, while saving the join result in step1, if I re-partition it on 'id' field, will I get any benefit? i.e. joined_df.repartition('id').write.parquet(path)
If I read the above repartitioned df, will spark understand that it is already repartitioned on id field, so that when I group by id, performance is hugely improved?