8

I'm running a code in Apache Spark on Azure that converts over 3 million XML-files into one CSV-file. I get the following error when I want to do this:

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1408098 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)

I know what the error means in general, but I don't know what it means in my case and I don't understand how to solve this.

The code is:

All XML files are loaded:

df = spark.read.format('com.databricks.spark.xml').option("rowTag", "ns0:TicketScan").load('LOCATION/*.xml')

All loaded files are put into a CSV-file:

 def saveDfToCsv(df, tsvOutput):
  tmpParquetDir = "dbfs:/tmp/mart1.tmp.csv"
  dbutils.fs.rm(tmpParquetDir, True)
  df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(tmpParquetDir)
  src = filter(lambda x: "part-00000" in x.name, dbutils.fs.ls('dbfs:/tmp/mart1.tmp.csv'))[0].path
  dbutils.fs.mv(src, tsvOutput)

saveDfToCsv(df, 'LOCATION/database.csv')

I hope my question is clear enough. If not, please allow me to explain it further.

I hope someone can help me.

Best regards.

Community
  • 1
  • 1
Ganesh Gebhard
  • 453
  • 1
  • 7
  • 20
  • 1
    Looks like you are sending a lot of data to driver. By spark.driver.maxResultSize this to 0 you can avoid this error. But fixing your program to not send such large amounts of data to driver is advisable. https://spark.apache.org/docs/latest/configuration.html – morfious902002 Oct 30 '18 at 20:54

3 Answers3

4

You need to change this parameter in the cluster configuration. Go into the cluster settings, under Advanced select spark and paste spark.driver.maxResultSize 0 (for unlimited) or whatever the value suits you. Using 0 is not recommended. You should optimize the job by re partitioning.

Salman Ghauri
  • 574
  • 1
  • 5
  • 16
1

Looks like your driver have a limited size for storing the result and your resulting files have cross the limit,so you can increase the size of result by the following command in your notebook.

sqlContext.getConf("spark.driver.maxResultSize")
res19: String = 20g

It gives the current max size of storage capacity as 20 GB, mine

sqlContext.setConf("spark.driver.maxResultSize","30g")

To increase the maxResultSize you can use the above command.

It's not recommended because its reduce the performance size of your cluster because then you have minimize the free space allocate to the temporary files for processing in the cluster.But i think it solved your issue.

Vijay Kumar Sharma
  • 377
  • 1
  • 4
  • 17
  • 5
    Hi Vijay, Thanks for the reply. When I set the maxResultSize to 10g and I check if this value is changed, it works. However, when I run the code again (while having changed it to 10g) I still get the same error. This error still includes that the maxResultSize of 4.0 GB is exceeded. So, somehow this value didn’t change. Do you know what is wrong here? – Ganesh Gebhard Nov 01 '18 at 16:08
  • 1
    Did we found any solution here please? I m facing similar issue. – prady Apr 05 '19 at 02:20
  • I had the same experience as Ganesh, it worked when I set it in cluster config though, also makes sense for it to be a parameter set on cluster spin up with the caveat that you may accidentally strain the driver's heap. Check the discussion here: https://stackoverflow.com/questions/39087859/what-is-spark-driver-maxresultsize – yousraHazem Sep 15 '21 at 20:26
1

You need to increase the maxResultSize value for the cluster.

The maxResultSize must be set BEFORE the cluster is started -- trying to set the maxResultSize in the notebook after the cluster is started will not work.

"Edit" the cluster and set the value in the "Spark Config" section under "Advanced Options".

Here is a screenshot of Configure Cluster for Databricks in AWS, but something similar probably exists for Databricks in Azure.

cluster configuration

In your notebook you can verify that the value is already set by including the following command:

enter image description here

Of course 8g may not be large enough in your case, so keep increasing it until the problem goes away -- or something else blows up! Best of luck.

Note: When I ran into this problem my notebook was attempting to write to S3, not directly trying to "collect" the data, so-to-speak.

warrens
  • 1,661
  • 18
  • 16