Unable to use SparkSQL to read from table with row size > 2GB

Question

I am trying to use SparkSQL to export my database to my S3 in Parquet format.

One of my tables contains row size > 2GB. The Spark was submitted with --conf spark.executor.memory=21g --conf spark.executor.memoryOverhead=9g --conf spark.executor.cores=8.

It seems there is a limitation from Spark: Maximum size of rows in Spark jobs using Avro/Parquet. But not sure if it’s the case.

Is there a workaround for that?

Have you check the number of partitions for the dataframe? Sometimes Increasing the number of partitions could resolve the issue. — Nikunj Kakadiya, Aug 04 '21 at 13:32

score 0 · Answer 1 · answered Aug 04 '21 at 17:13

The default value of spark.driver.maxResultSize is 1g. You might need to set it higher if you encounter this issue:

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized
results of XXXX tasks (X.0 GB) is bigger than spark.driver.maxResultSize (X.0 GB)

Reference

https://spark.apache.org/docs/2.4.8/configuration.html#application-properties

Unable to use SparkSQL to read from table with row size > 2GB

1 Answers1