As far as I know when Spark performs broadcast join it firstly collects smallest (broadcast) RDD to driver to make a broadcast variable from it, and only then uploads it to each target node.
Sometimes it leads to driver memory outflows if broadcasting RDD > spark.driver.memory.
The question: why it works in such way? It is more efficient to just shuffle broadcast data between target nodes, because amount of data to shuffle is the same but we can avoid driver overflow.
Example: Say you have 3 nodes and 1 gb of data to broadcast on each node and each node have 1gb/s throughput.
Spark approach - each node have to upload to driver its piece of data (1gb) and download broadcast variable (3 * 1g = 3gb), so each node should transfer 4 gb total and it takes 4s.
Shuffle approach - one node have to upload 1gb to 2 other nodes and download 1gb from these. Again, total amount is 4 gb and it takes 4s.