Logs to find out why my Spark job gets killed by Yarn

Question

I am entering the world of Spark. I use Spark v1.5.0 (cdh5.5.1). I run a Spark job that reads .gz files whose size is between 30MB to 150MB, with a compression factor ~20. I process around 300 of these in 50 executors. I use Yarn in yarn-client mode. The job first reads data from the files and transforms them into RDD[List[String]] (a simple spark map). I figured out that my job was failing because "someone" was killing my executors thanks to this SO question, but it was not trivial to find out who as the only error I was getting from the logs (all containers-merged stdout and stderr logs that I got using yarn log command) was:

16/02/12 08:28:00 INFO rdd.NewHadoopRDD: Input split: hdfs://xxx:8020/user/mjost/file001.gz:0+39683663
16/02/12 08:28:38 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

I suspect Yarn kills them because they probably take more memory than reserved. I managed to fix this issue by increasing spark.yarn.executor.memoryOverhead, but I would like to understand why Yarn kills them to better handle the situation. My question:

Where could I get more precise information telling why executors were killed?

Yes, I have it enabled (spark.speculation=true, spark.speculation.interval=5000ms, spark.speculation.multiplier=1.8, spark.speculation.quantile=0.5). You think it could be the problem? — mauriciojost, Feb 12 '16 at 10:48
Long running tasks will be tried again due to that setting. Whoever comes first will cause other task to die. If you are getting required output from the application, you can forget these errors. — Ravindra babu, Feb 12 '16 at 10:58
Oh I see now your point. I am aware of the the failures that come from speculation, the failures I am talking about are different, these make my job fail. — mauriciojost, Feb 12 '16 at 11:22

Logs to find out why my Spark job gets killed by Yarn

0 Answers0