I am wondering whether we can force Spark to use a custom partitioning key during a join operation with two dataframes.
For example, let's consider
df1: DataFrame - [groupid, other_column_a]
df2: DataFrame - [groupid, other_column_b]
If I run
df_join = df1.join(df2, "groupid")
Spark will set "groupid" as a partition key and performs the join on each partition. Problem is, this can runs out of memory on a machine if the partition is too big.
However, it seems theoretically possible to perform the join with say (groupid, other_column_a)
as the partitioning key (to reduce the size of each partition).
Is it possible to do it with Spark ? I tried to do
df1.repartition("group_id","other_column_a")
upfront but this is overriden by the join (I check it with df_join.explain()
). I can't find any resource online that explains how to do this.
Thanks!