2

I have a 100% reproductible OutOfMemoryError (most often due to a GC overhead limit exceeded) when running my Spark application. That happens approximately on the 700th stage.

As the error stack always includes classes like .ui., TaskSchedulerImpl, etc., I've concluded that the problem lies not in executors, but in the driver process itself. This conclusion is backed up by the following observation: minutes before the OOM, the stdout output starts pausing for a second or so, printing lots of lines momentarily immediately after the pause.

spark.driver.memory is configured to be 10G, but the debugging tools used show that only 1Gb is used by the driver:

  1. I've used these great instructions on collecting GC statistics and analyzing it with the gceasy.io service; it clearly showed that:
    • The maximum heap usage after GC is approximately 1Gb.
    • Close to the moment of OOM, the 'heap usage' graph almost touches the maximum of 1Gb, and the numerous GC events fail to affect that. GC overhead limit exceeded at its best.
  2. I've used the MAT to analyse the heap dump created immediately after the OutOfMemoryError.
    • The heap dump contains approximately the same 1Gb of data.
    • Dominator tree shows that more than half of that is consumed by UI objects.

This question's answer suggests that the 10Gb-1Gb=9Gb may be used by JNI libraries; but apparently the Spark does not use that at it's core; neither am I.

I've used this question's answer to minimize the UI data retained. As a result, my application runned succesfully. But i am not ready to part with all the precious debugging data that can be explored using the Spark UI.

Also, I was not able to find any explanation of the Spark's driver memory model.

The question is: how fo I retain UI debugging data and do not run into OOMs on my driver?

Matvey Zhuravel
  • 154
  • 2
  • 13

1 Answers1

3

The actual problem was that only 1Gb of memory was used by driver process, despite the setting spark.driver.memory=10G.

According to documentation: in client mode, this config (spark.driver.memory) must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file.

I was using the client mode. Moving the setting from Spark context parameter to spark-submit command line parameter solved the issue.

P.S. "If nothing works as expected, read the manual" (c).

Matvey Zhuravel
  • 154
  • 2
  • 13