Glue 3.0 / Spark 3.1 migration - unexpected broadcast join exceeding spark.driver.maxResultSize

Question

I have a similar problem described in this SO post:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of [N] tasks is bigger than spark.driver.maxResultSize (1024.0 MB)

This worked previously on Glue 2.0 with Spark 2.4, and is breaking when I tried with Glue 3.0 and Spark 3.1.

Since I am not doing an explicit broadcast, I believe this is because Spark 3.1 is automatically converting the join to broadcast at runtime.

My question is: How do I track down and fix Spark's mis-judgement of size or statistics? The join is between two data frames, each of which are the result of joins or aggregations of other data frames so several layers removed from physical files in S3.

updated question based on comment about adaptive framework. so my question now comes down to why did it choose broadcast in this specific case? — wrschneider, Jan 13 '22 at 13:15
...and the question became totally different. This is called moving the goalposts. — mazaneicha, Jan 13 '22 at 14:25

score 0 · Answer 1 · answered Feb 24 '23 at 08:38

0

You can disable Spark autobroadcast by passing this in the Spark config.

SparkSession.config("spark.sql.autoBroadcastJoinThreshold", -1)

answered Feb 24 '23 at 08:38

Satyaki Mallick

31
5

1

Yes I know, the question is how do you know what is being broadcast and why? – wrschneider Feb 24 '23 at 14:56
In the Spark UI, you can see what is being broadcast under the SQL/Dataframe Tab. It shows a graphical plan and which DF is being broadcast. About why - Most likely, you have the threshold set and you have to explicitly unset it using Spark config. – Satyaki Mallick Feb 25 '23 at 11:06

Glue 3.0 / Spark 3.1 migration - unexpected broadcast join exceeding spark.driver.maxResultSize

1 Answers1