Broadcast join to join two dataframes in SPARK efficiently

Question

I am having a DataFrame df1 which has some 2 Million rows. I have already repartitioned it on the basis of a key called ID, since the data was ID based -

df=df.repartition(num_of_partitions,'ID')

Now, I wish to join this df to a relatively small sized DataFrame df2, on the basis of a common column hospital_code, but I do not want to lose my ID based partitioning of df -

df.join(df1,'key','left')

I have read that in case one DataFrame is larger than the other, then it's a good idea to use broadcast joins like shown below and this would maintain the partitioner of the larger DataFrame df. But, I am not certain of it.

from pyspark.sql.functions import broadcast
df.join(broadcast(df1),'key','left')

Can anyone suggest as to what is the efficient way to approach this problem and how can we maintain the partitioner of the larger DataFrame without making many compromises on latency and shuffle related issues etc?

It will maintain the partitioning of the larger data frame as a copy of the small data frame is sent to each partition of the larger data frame and then relevant rows are merged together — sramalingam24, Dec 14 '18 at 15:44
Thanks for your comments.So, you want to say that there is no need to `broadcast` the `df2` and the join `df.join(df1,'key','left')` is as good as it could possibly be? Is it still the most efficient way? — cph_sto, Dec 14 '18 at 15:47
nothing here to answer your question https://stackoverflow.com/questions/37487318/spark-sql-broadcast-hash-join — eliasah, Dec 14 '18 at 15:50
Depends on the size of the smaller data frame. Look here for some pitfalls of broadcast join and configurations that control it https://issues.apache.org/jira/browse/SPARK-12837 — sramalingam24, Dec 14 '18 at 15:54
Well, it around 500 rows in the small dataframe as opposed to 2 million in the bigger one. — cph_sto, Dec 14 '18 at 16:03
Thanks eliasah for a very useful link. It's clear many of my misgivings. This link shows it pictorially https://www.oreilly.com/library/view/high-performance-spark/9781491943199/ch04.html — cph_sto, Dec 14 '18 at 16:34

Broadcast join to join two dataframes in SPARK efficiently

0 Answers0