Spark - How to re-partition dataframe on basis of columns with minimum shuffle?

Question

We have multiple Dataframes.

One of the dataframe is the primary one, which is joined with the other dataframes using left-outer joins. All these dataframes are joined on 4 columns (say col1,col2,col3,col4).

To reduce data shuffle, currently we are re-partitioning all the dataframes on the 4 join columns, and then joining these dataframes (left-outer).

Is there a better way to join/repartition these dataframes, so that the data shuffle is minimum?

Thanks

score 0 · Answer 1 · answered May 14 '19 at 07:58

Repartition will not avoid the shuffle it will optimize the joins. If your both dataframes are big and are not small enough to fit into memory for broadcast hash joins.. you can save your dataframe as bucketed tables and can then perform sort merge join. This way you can skip the sort phase shuffle which usually takes place before joining the two big dataframes.. see link below Spark join *without* shuffle This technique is useful only when you have to join same dataframes multiple times.. as bucketing these table will also cause some overhead for you spark application.

score 0 · Answer 2 · answered Nov 03 '19 at 12:45

0

Late reply to my post. We ended up using broadcast.

We removed re-partition from both the dataframes, and broadcast the smaller dataframe.

answered Nov 03 '19 at 12:45

Anuj Mehra

320
3
19

Spark - How to re-partition dataframe on basis of columns with minimum shuffle?

2 Answers2