How to avoid excessive shuffles in join operation in pyspark?

Question

I have a large spark dataframe which is around 25 GB in size which I have to join with another dataframe with about 15 GB in size.

Now when I run the code it is taking around 15 minutes to complete

Resource allocation is 40 executors with 128 GB memory each

When I went through its execution plan, the sort merge join was being performed.

The problem is:

The join is performed around 5 to 6 times on same key but different tables because of that it was taking most of the time sorting the data and co-locating the partitions before merging/joining the data for every join performed.

So is there any way to sort the data before performing the join so that the sort operation is not performed for each join or optimized in such a way that it takes less time sorting and more time actually joining the data?

I just want to sort my dataframe before performing the join but not sure how to do it?

For example:

If my dataframe is joining on id column

joined_df = df1.join(df2,df1.id==df2.id)

How can I sort the dataframe based on 'id' before joining so that the partitions are co-located?

https://stackoverflow.com/questions/42985178/spark-colocated-join-between-two-partitioned-dataframes you may convert them to rdd and then use the same partitioner just before doing the join. — Mahendra Singh Meena, May 11 '20 at 12:41

score 3 · Accepted Answer · answered May 12 '20 at 17:32

3

So is there any way to sort the data before performing the join so that the sort operation is not performed for each join or optimized in such a way that it takes less time sorting and more time actually joining the data?

That smells like bucketing.

Bucketing is an optimization technique that uses buckets (and bucketing columns) to determine data partitioning and avoid data shuffle.

The idea is to bucketBy the datasets so Spark knows that keys are co-located (pre-shuffled already). The number of buckets and the bucketing columns have to be the same across DataFrames participating in join.

Please note that this is supported for Hive or Spark tables (saveAsTable) as the bucket metadata is fetched from a metastore (Spark's or Hive's).

answered May 12 '20 at 17:32

Jacek Laskowski

72,696
27
242
420

Can we get bucket information if we save our data on s3? – Shubham Jain May 12 '20 at 19:34
1

That's my understanding since bucket information is in a metastore not a file system. – Jacek Laskowski May 13 '20 at 10:43
My usecase includes reading the data from s3, so if somehow I can exclude the excessive shuffle. please help. I have gone through most of your blogs and they are amazing. Thanks – Shubham Jain May 13 '20 at 10:46

Chris · Answer 2 · 2021-05-31T13:42:48.037

3

I've had good results in the past by repartitioning the input dataframes by the join column. While this doesn't avoid a shuffle, it does make the shuffle explicit, allowing you to choose the number of partitions specifically for the join (as opposed to setting spark.sql.shuffle.partitions which will apply to all joins).

Bucketing is a useful technique if you need to read a dataset multiple times over multiple jobs, when the cost of writing out to persistent storage pays off.

edited May 31 '21 at 13:42

answered May 12 '20 at 18:25

Chris

1,335
10
19

Repartition is itself an expensive operation but I'll give it a try...thanks – Shubham Jain May 12 '20 at 19:35

How to avoid excessive shuffles in join operation in pyspark?

2 Answers2

Linked