3

I have a Spark job that throws "java.lang.OutOfMemoryError: GC overhead limit exceeded".

The job is trying to process a filesize 4.5G.

I've tried following spark configuration:

--num-executors 6  --executor-memory 6G --executor-cores 6 --driver-memory 3G 

I tried increasing more cores and executor which sometime works, but takes over 20 minutes to process the file.

Could I do something to improve the performance? or stop the Java Heap issue?

diplomaticguru
  • 675
  • 3
  • 8
  • 19
  • Have you tried allocating more heap size at runtime? – Mark Jun 15 '15 at 19:08
  • Identify which operation is causing the OOME and try to do it differently. Post on SO for help. – Jean Logeart Jun 15 '15 at 19:11
  • GC overhead limit exceeded means that JVM is not able to reclaim any considerable amount of memory after GC pause. This indicates some kind of memory leak - You may be at luck by tuning heap size parameter `spark.executor.memory`. I do not think this is really getting set by your --executor-memory parameter. Take look at this SO : http://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory post as well – ring bearer Jun 15 '15 at 19:19
  • are you caching the RDDs?? – Vijay Innamuri Jun 16 '15 at 04:44
  • @Mark - tried that but the problem does show up now and gain. – diplomaticguru Jun 16 '15 at 10:36
  • @ringbearer, tried that but the result is the same. – diplomaticguru Jun 16 '15 at 10:37
  • @VijayInnamuri, yes I'm caching. Well, initially I cached it into Memory but later persisted into Memory_And_Disk. I've noticed that stages were failing due to lost executor failures, so they were being recomputed, so degrading the performance. – diplomaticguru Jun 16 '15 at 10:39
  • Does your cluster have enough memory to process this dataset? – Vijay Innamuri Jun 16 '15 at 11:27
  • Make sure your `spark.memory.fraction=0.6` . If it is higher than that you run into garbage collection errors, see https://stackoverflow.com/a/47283211/179014 – asmaier Nov 14 '17 at 10:24

2 Answers2

3

Only solution is to fine tune the configuration.

As per my experience I can say the following points for OOM:

  • cache an RDD only if you are going to use it more than once

Still if you need to cache then consider then analyze the data and application with respect to resources.

  • If your cluster has enough memory then increase the spark.executor.memory to its max
  • Increase the no of partitions to increase the parallelism
  • Increase the dedicated memory for caching spark.storage.memoryFraction. If lot of shuffle memory is involved then try to avoid or split the allocation carefully
  • Spark's caching feature Persist(MEMORY_AND_DISK) is available at the cost of additional processing (serializing, writing and reading back the data). Usually CPU usage will be too high in this case
Vijay Innamuri
  • 4,242
  • 7
  • 42
  • 67
0
  1. You can try increasing the driver-memory. If you don't have enough memory may be you can reduce it from executor-memory

  2. Check the spark-ui to see what is the scheduler delay. You can access the spark UI on port 4040. If the scheduler delay is high, quite often, the driver may be shipping large amount of data to the executors. Which needs to be fixed.

SanS
  • 385
  • 8
  • 21
  • I already tried increasing the driver-memory but no joy. There is no schedule delay. The job starts running within 5-10 seconds. – diplomaticguru Jun 16 '15 at 10:41