I'm having a "GC overhead limit exceeded" on Spark 1.5.2 (reproductible every ~20 hours)

Question

I'm having a "GC overhead limit exceeded" on Spark 1.5.2 (reproductible every ~20 hours) I have no memory leak in MY code. Can it be Spark's fault ? Since Spark 1.6.0, they change the memory management, will it fixe this problem ?

2016-09-05 19:40:56,714 WARN TaskSetManager: Lost task 11.0 in stage 13155.0 (TID 47982, datanode004.current.rec.mapreduce.m1.p.fti.net): java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
    at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
    at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
    at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:203)
    at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:202)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:202)
    at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
    at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
    at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
    at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
    at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

2016-09-05 19:40:56,725 WARN TaskSetManager: Lost task 7.0 in stage 13155.0 (TID 47978, datanode004.current.rec.mapreduce.m1.p.fti.net): java.io.FileNotFoundException: /var/opt/hosting/data/disk1/hadoop/yarn/usercache/nlevert/appcache/application_1472802379984_2249/blockmgr-f71761be-e12b-4bbc-bf38-9e6f7ddbb3a2/14/shuffle_2171_7_0.data (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:177)
    at org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(WritablePartitionedPairCollection.scala:55)
    at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:681)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Memory consumption

Your library could have a memory leak in a library you are using, I would monitor your memory consumption to see if more memory is retained after a full GC. — Peter Lawrey, Sep 06 '16 at 08:28
I did a memory dump just before the GC crash (see the screenshot of the memory consumption). I have many String objects retained in memory (contained in RDDs). I'm sure I don't have any leak in my code as I don't keep anything in memory myself. I just use spark window (60 seconds)... I suppose Spark should delete old/useless RDDs itself right ? — user2459075, Sep 06 '16 at 12:35
The thread dump (which is not the same as a memory dump) indicate you ran out of memory when performing an operation of your cache. I expect the cache size is too large for the amount of memory you have. Set you max memory high enough and you might find it doesn't grow larger, or you could try reducing the size of your cache. — Peter Lawrey, Sep 06 '16 at 12:38
**reduce** - *make smaller or less in amount, degree, or size.* — , Sep 06 '16 at 15:17
**cache** - *a collection of items of the same type stored in a hidden or inaccessible place.* — , Sep 06 '16 at 15:17
In your stack trace you can see the crash is in a operation where it us trying to determine the size of a cache in the CacheManager. — Peter Lawrey, Sep 06 '16 at 15:32

score 0 · Answer 1 · answered Sep 06 '16 at 08:28

0

In similar cases I've faced, increasing the memory solved the issue. Try the following:

Either for spark-submit or spark-shell add the following arguments:

executor.memory=6G to set the memory of the workers
driver.memory=6G to set the driver's memory

In your case, the first will probably help

answered Sep 06 '16 at 08:28

IrishDog

460
1
4
21

1

Thank you Irish, I already increase the memory (from 2g to 4g), but there is still a leak as it will crash later. – user2459075 Sep 06 '16 at 12:31

score 0 · Accepted Answer · answered Sep 06 '16 at 15:13

The Other option is that the data must be maintained in spark in text-format. Please try to use serilaization and compression for the data that spark is maintaining in memory. try this:

val conf = new SparkConf().setAppName("Test-App") conf.set("spark.serialization","org.apache.spark.serializer.KyroSerializer") conf.set("spark.io.compression.codec","org.apache.spark.io.SnappyCompressionCodec")

I'm having a "GC overhead limit exceeded" on Spark 1.5.2 (reproductible every ~20 hours)

2 Answers2