when running spark streaming app that consumes data from kafka topic with 100 partitions, and the streaming runs with 10 executors, 5 cores and 20GB RAM per executor, the executors crash with the following log:
ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred.
ERROR YarnClusterScheduler: Lost executor 18 on worker23.oct.com: Slave lost
ERROR ApplicationMaster: RECEIVED SIGNAL TERM
this exception appears in spark JIRA:
https://issues.apache.org/jira/browse/SPARK-17380
and someone wrote that after upgrading to spark 2.0.2 the problem was solved. however we use spark 2.1 as part of HDP 2.6. so I guess this bug wasn't solved in spark 2.1.
there's also someone who encountered this bug and wrote about it in spark user list but got no answer:
BTW - the streaming app doesn't call cache()
or persist()
, so no caching is involved whatsoever.
Did anyone encounter a streaming app that crashed on such bug?