2

We have multiple Dataframes.

One of the dataframe is the primary one, which is joined with the other dataframes using left-outer joins. All these dataframes are joined on 4 columns (say col1,col2,col3,col4).

To reduce data shuffle, currently we are re-partitioning all the dataframes on the 4 join columns, and then joining these dataframes (left-outer).

Is there a better way to join/repartition these dataframes, so that the data shuffle is minimum?

Thanks

Anuj Mehra
  • 320
  • 3
  • 19

2 Answers2

0

Repartition will not avoid the shuffle it will optimize the joins. If your both dataframes are big and are not small enough to fit into memory for broadcast hash joins.. you can save your dataframe as bucketed tables and can then perform sort merge join. This way you can skip the sort phase shuffle which usually takes place before joining the two big dataframes.. see link below Spark join *without* shuffle This technique is useful only when you have to join same dataframes multiple times.. as bucketing these table will also cause some overhead for you spark application.

vikrant rana
  • 4,509
  • 6
  • 32
  • 72
0

Late reply to my post. We ended up using broadcast.

We removed re-partition from both the dataframes, and broadcast the smaller dataframe.

Anuj Mehra
  • 320
  • 3
  • 19