3

I am using Jupyter notebook in emr to handle large chunks of data. While processing data I see this error:

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 108 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

It seems I need to update the maxResultsSize in the spark config. How do I set spark maxResultsSize from jupyter notebook.

Already checked this post: Spark 1.4 increase maxResultSize memory

Also, In emr notebook, spark context is already given, is there any way to edit spark context and increase maxResultsSize

Any leads would be very helpful.

Thanks

Amit Kumar
  • 377
  • 4
  • 17
  • here is the answer of your question : https://stackoverflow.com/questions/31058504/spark-1-4-increase-maxresultsize-memory – ZINE Mahmoud May 11 '20 at 11:05
  • 1
    I already tried that, but once I stop the spark context , I see this error: An error was encountered: Invalid status code '400' from https://** with error payload: {"msg":"requirement failed: Session isn't active."} – Amit Kumar May 11 '20 at 11:09
  • facing this problem too, anyone can help? – deedeeck28 May 20 '20 at 17:11

1 Answers1

8

You can set livy configuration at the start of spark session See https://github.com/cloudera/livy#request-body

Place this at the start of your code

%%configure -f
{"conf":{"spark.driver.maxResultSize":"15G"}}

Check settings of your session by printing it in the next cell:

print(spark.conf.get('spark.driver.maxResultSize'))

This should resolve the problem

deedeeck28
  • 347
  • 2
  • 8
  • Yes, Using this command, we can update maxResultSize. Also, if we want to update any other parameters then we can update it using same. And to check the updated config, use this command: %%info – Amit Kumar May 21 '20 at 13:11