I have two large dataframes df1
and df2
partitioned by column a
, and I want to efficiently compute a left join on both a
and another column b
:
df1.join(df2, on=['a', 'b'], how='left_outer')
When written as above, Spark reshuffles both dataframes by key (a, b)
, which is extremely inefficient. Instead, I would like it to take advantage of the existing partitioning by a
to avoid the shuffle (performing the join within each partition), which should be a lot faster (especially since I have further processing steps that benefit from this partitioning).
Is there any way to prevent this shuffle and obtain a resulting dataframe partitioned by a
?
Note that if it was an inner join I could do the following, but (1) I'm not sure if it'd be efficient, and anyway (2) it doesn't work with a left join (I'm only providing it in case it'd help someone else):
df1.join(df2, on=['a'], how='inner').filter(df1.b == df2.b)
PS: both dataframes are too large to be broadcasted