I have a large spark dataframe which is around 25 GB in size which I have to join with another dataframe with about 15 GB in size.
Now when I run the code it is taking around 15 minutes to complete
Resource allocation is 40 executors with 128 GB memory each
When I went through its execution plan, the sort merge join was being performed.
The problem is:
The join is performed around 5 to 6 times on same key but different tables because of that it was taking most of the time sorting the data and co-locating the partitions before merging/joining the data for every join performed.
So is there any way to sort the data before performing the join so that the sort operation is not performed for each join or optimized in such a way that it takes less time sorting and more time actually joining the data?
I just want to sort my dataframe before performing the join but not sure how to do it?
For example:
If my dataframe is joining on id column
joined_df = df1.join(df2,df1.id==df2.id)
How can I sort the dataframe based on 'id' before joining so that the partitions are co-located?