0

I am trying to use SparkSQL to export my database to my S3 in Parquet format.

One of my tables contains row size > 2GB. The Spark was submitted with --conf spark.executor.memory=21g --conf spark.executor.memoryOverhead=9g --conf spark.executor.cores=8.

It seems there is a limitation from Spark: Maximum size of rows in Spark jobs using Avro/Parquet. But not sure if it’s the case.

Is there a workaround for that?

supang
  • 33
  • 4
  • Have you check the number of partitions for the dataframe? Sometimes Increasing the number of partitions could resolve the issue. – Nikunj Kakadiya Aug 04 '21 at 13:32

1 Answers1

0

The default value of spark.driver.maxResultSize is 1g. You might need to set it higher if you encounter this issue:

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized
results of XXXX tasks (X.0 GB) is bigger than spark.driver.maxResultSize (X.0 GB)

Reference

chehsunliu
  • 1,559
  • 1
  • 12
  • 22