I am trying to do a broadcast join on two tables. The size of the smaller table will vary based upon the parameters but the size of the larger table is close to 2TB.
What I have noticed is that if I don't set the spark.sql.autoBroadcastJoinThreshold
to 10G some of these operations do a SortMergeJoin
instead of a broadcast join. But the size of the smaller table shouldn't be this big at all. I wrote the smaller table to a s3 folder and it took only 12.6 MB of space.
I did some operations on the smaller table so the shuffle size appears on the Spark History Server and the size in memory seemed to be 150 MB, nowhere near 10G. Also, if I force a broadcast join on the smaller table it takes a long time to broadcast, leading me to think that the table might not just be 150 MB in size.
What would be a good way to figure out the actual size that Spark is seeing and deciding whether it crosses the value defined by spark.sql.autoBroadcastJoinThreshold
?