Setting spark.driver.maxResultSize in EMR notebook jupyter

Question

I am using Jupyter notebook in emr to handle large chunks of data. While processing data I see this error:

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 108 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

It seems I need to update the maxResultsSize in the spark config. How do I set spark maxResultsSize from jupyter notebook.

Already checked this post: Spark 1.4 increase maxResultSize memory

Also, In emr notebook, spark context is already given, is there any way to edit spark context and increase maxResultsSize

Any leads would be very helpful.

Thanks

here is the answer of your question : https://stackoverflow.com/questions/31058504/spark-1-4-increase-maxresultsize-memory — ZINE Mahmoud, May 11 '20 at 11:05
I already tried that, but once I stop the spark context , I see this error: An error was encountered: Invalid status code '400' from https://** with error payload: {"msg":"requirement failed: Session isn't active."} — Amit Kumar, May 11 '20 at 11:09

deedeeck28 · Accepted Answer · 2020-12-17T10:19:43.283

8

You can set livy configuration at the start of spark session See https://github.com/cloudera/livy#request-body

Place this at the start of your code

%%configure -f
{"conf":{"spark.driver.maxResultSize":"15G"}}

Check settings of your session by printing it in the next cell:

print(spark.conf.get('spark.driver.maxResultSize'))

This should resolve the problem

edited Dec 17 '20 at 10:19

answered May 21 '20 at 01:34

deedeeck28

347
2
8

Yes, Using this command, we can update maxResultSize. Also, if we want to update any other parameters then we can update it using same. And to check the updated config, use this command: %%info – Amit Kumar May 21 '20 at 13:11

Setting spark.driver.maxResultSize in EMR notebook jupyter

1 Answers1