2

Spark BroadcastJoin hint not broadcasting as expected (Spark 2.3)

I have 2 Dataframes, let's say a & b. Dataframe a is fairly small with 50000 rows, about 940 K size. Dataframe b is large with 12TB of data. I am joining like:

Broadcast(a).join(b, $"a.id" === $"b.id", "left")

but query plan says SortMergeJoin [id#163], [id#187], LeftOuter I was hoping to see BroadcastHashJoin

I tried other options & observed that:

b.join(Broadcast(a), $"a.id" === $"b.id", "left") 

gives me BroadcastHashJoin [id#163], [id#187], LeftOuter, BuildRight This was just an experiment. I can't use this because I need "a left join b".

I tried b.join(Broadcast(a), $"a.id" === $"b.id", "right") but this again gives me SortMergeJoin.

Memory is not an issue. I don't see any spills. The driver has 16GB too.

Any idea, why Spark might not Broadcast in spite of smaller dataset and an explicit hint?

airsquared
  • 571
  • 1
  • 8
  • 25
ri2
  • 124
  • 1
  • 4
  • 2
    It is more likely that your small dataframe is larger than the setting `spark.broadcast.blockSize` which is 10M by default. First make an estimation of the size of the dataframe you want to broadcast then try to change the broadcast size controlled by `spark.broadcast.blockSize` as described here https://stackoverflow.com/questions/41045917/what-is-the-maximum-size-for-a-broadcast-object-in-spark – abiratsis May 10 '19 at 10:15

0 Answers0