Spark BroadcastJoin hint not broadcasting as expected (Spark 2.3)
I have 2 Dataframes, let's say a & b. Dataframe a is fairly small with 50000 rows, about 940 K size. Dataframe b is large with 12TB of data. I am joining like:
Broadcast(a).join(b, $"a.id" === $"b.id", "left")
but query plan says SortMergeJoin [id#163], [id#187], LeftOuter
I was hoping to see BroadcastHashJoin
I tried other options & observed that:
b.join(Broadcast(a), $"a.id" === $"b.id", "left")
gives me BroadcastHashJoin [id#163], [id#187], LeftOuter, BuildRight
This was just an experiment. I can't use this because I need "a left join b".
I tried b.join(Broadcast(a), $"a.id" === $"b.id", "right")
but this again gives me SortMergeJoin.
Memory is not an issue. I don't see any spills. The driver has 16GB too.
Any idea, why Spark might not Broadcast in spite of smaller dataset and an explicit hint?