How to force enable broadcast join in Spark

Question

I have a spark SQL query that goes like this -

SELECT /*+ BROADCASTJOIN (sbg_published.sk_e2e_web_all_vis) */
       a.* 
FROM 
       sbg_published.sk_e2e_web_all_vis a
LEFT JOIN 
       sbg_published.web_funnel_detail_v4 b
       ON a.col1 = b.col1

I am running this query using spark.sql() The first table has around 1 million records and the second has 1.5 billion records

I am trying to force spark to use broadcast join but instead, it is adopting sortmerge join.

Following are the spark params I have used

"spark.sql.autoBroadcastJoinThreshold" = "4048576000"
"spark.sql.broadcastTimeout" = "100000"
"spark.sql.shuffle.partitions" = 500
"spark.sql.adaptive.enabled" = "true"
"spark.sql.adaptive.coalescePartitions.enabled" = "true"
"spark.sql.adaptive.autoBroadcastJoinThreshold" ="4048576000"
"spark.sql.join.preferSortMergeJoin" = "false"
"spark.shuffle.io.maxRetries"="10"
"spark.dynamicAllocation.enabled"="true"
"spark.shuffle.service.enabled"="true"
"spark.shuffle.compress"="true"
"spark.shuffle.spill.compress"="true"
"spark.driver.maxResultSize"="0"

This is the DAG -

I also then tried this parameter -

"spark.sql.join.preferSortMergeJoin" = "false"

This made the sortmerge join to go and adopted shuffle hash join instead.

I am using spark 3.2

Thanks in advance!

Please check this answer https://stackoverflow.com/a/68296266/5594180 — Saša Zejnilović, May 23 '22 at 19:07
Does this answer your question? [BROADCASTJOIN hint is not working in PySpark SQL](https://stackoverflow.com/questions/62622742/broadcastjoin-hint-is-not-working-in-pyspark-sql) — mazaneicha, May 24 '22 at 02:06

score 0 · Answer 1 · answered Aug 18 '22 at 07:29

Besides "spark.sql.autoBroadcastJoinThreshold" , spark has hard broadcast size limit 8G. You can't force spark to broadcast dataframe once it exceeds 8G. So you can try to resolve it by :

Rewrite the sql to broadcast the small table.
Rewrite the sql by union

How to force enable broadcast join in Spark

1 Answers1