How do I enable broadcast join in spark left outer join?

Asked Jul 04 '22 at 14:46

Active Jul 04 '22 at 14:46

Viewed 248 times

Below is an analogic schema

Table 1 - Players (Upto 1M players per team, HUGE)
player_number - INT
team_id - bigint (indexed)

Table 2 - Team (Relatively small table which I want to broadcast)
team_id - bigInt (indexed)
team_size - INT

Below is my query which is using sort-merge join which is becoming slow

select id as teamId , teamSize from teams
  left outer join (select teamId, count(1) as size from players group by teamId) as players 
    on (teams.id= players.teamId) where players.size != teams.teamSize;

The table players can be quite huge. There can be as many as 1M players in a team. The table players are partitioned by team_id so the above query can run by itself on a single partition.

How do I broadcast teams table which is relatively smaller to use broadcast join instead of sort-merge join.

Attaching query plan

asked Jul 04 '22 at 14:46

sharin gan

This is the same question I've asked over a year ago. – ZygD Jul 04 '22 at 14:51
Does this answer your question? [Broadcast left table in a join](https://stackoverflow.com/questions/68123190/broadcast-left-table-in-a-join) – ZygD Jul 04 '22 at 14:51
Any solution/optimization? – sharin gan Jul 04 '22 at 14:52
You see all the info in that question. Bottom line - you can't, but there are workarounds. Read the question and answers. They are decent. – ZygD Jul 04 '22 at 14:54
How much performance improvement did you see with using this method? @ZygD – sharin gan Jul 05 '22 at 04:45
see [this](https://stackoverflow.com/q/68296559/8279585) Q's answers for broadcast joins in sql and pyspark api – samkart Jul 05 '22 at 06:21
why can't we change the order of the join -- players join teams -- and broadcast teams. you select whatever is required from the two tables – samkart Jul 05 '22 at 06:36

How do I enable broadcast join in spark left outer join?

0 Answers0