I'm running a code in Apache Spark on Azure that converts over 3 million XML-files into one CSV-file. I get the following error when I want to do this:
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1408098 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
I know what the error means in general, but I don't know what it means in my case and I don't understand how to solve this.
The code is:
All XML files are loaded:
df = spark.read.format('com.databricks.spark.xml').option("rowTag", "ns0:TicketScan").load('LOCATION/*.xml')
All loaded files are put into a CSV-file:
def saveDfToCsv(df, tsvOutput):
tmpParquetDir = "dbfs:/tmp/mart1.tmp.csv"
dbutils.fs.rm(tmpParquetDir, True)
df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(tmpParquetDir)
src = filter(lambda x: "part-00000" in x.name, dbutils.fs.ls('dbfs:/tmp/mart1.tmp.csv'))[0].path
dbutils.fs.mv(src, tsvOutput)
saveDfToCsv(df, 'LOCATION/database.csv')
I hope my question is clear enough. If not, please allow me to explain it further.
I hope someone can help me.
Best regards.