I am having a DataFrame
df1
which has some 2 Million rows. I have already repartitioned it on the basis of a key called ID
, since the data was ID
based -
df=df.repartition(num_of_partitions,'ID')
Now, I wish to join this df
to a relatively small sized DataFrame
df2
, on the basis of a common column hospital_code
, but I do not want to lose my ID
based partitioning of df
-
df.join(df1,'key','left')
I have read that in case one DataFrame is larger than the other, then it's a good idea to use broadcast
joins like shown below and this would maintain the partitioner of the larger DataFrame df
. But, I am not certain of it.
from pyspark.sql.functions import broadcast
df.join(broadcast(df1),'key','left')
Can anyone suggest as to what is the efficient way to approach this problem and how can we maintain the partitioner of the larger DataFrame
without making many compromises on latency and shuffle related issues etc?