55

I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following error and slowly all executors get removed from UI and my job fails

15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc client disassociated
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 6 on myhost2.com: remote Rpc client disassociated

I use the following command to schedule Spark job in yarn-client mode

 ./spark-submit --class com.xyz.MySpark --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 3g --master yarn-client --executor-memory 2G --executor-cores 8 --num-executors 12  /home/myuser/myspark-1.0.jar

What is the problem here? I am new to Spark.

halfer
  • 19,824
  • 17
  • 99
  • 186
Umesh K
  • 13,436
  • 25
  • 87
  • 129
  • 1
    try increasing executor memory. one of the common reason of executor failures is insufficient memory. when executor consumes more memory then assigned yarn kills it. logs provided by you gives no clue about reason of failure. use"yarn logs -applicationId " to check executor logs. – banjara Jul 30 '15 at 16:41
  • I am seeing this only when we run long running spark jobs. If it was a memory issue it should have failed initially. – Bonnie Varghese Oct 10 '15 at 21:58
  • 2
    Have you figured out how to solve this problem? I observe the same one with no logs confirming that executor went out of memory. I only see that driver killed executor, and that executor got SIGTERM signal, after this my application goes through infinite number of stage retries that always fail because single task fails with FetchFailedException: Executor is not registered. For some reason this type of task failure isn't even retried on different host, the whole stage is retried. – Dmitriy Sukharev Oct 18 '15 at 20:14
  • 3
    Use divide conquer make your spark job do less things in my case I divided my one Spark job into five different jobs. Make sure you shuffle less data like group by join etc. Make sure you dont cache much data use filter then do cache if needed use MEM_DISK_SER. If you dont cache much try to reduce spar.storage.fraction from 0.6 to less. Use Kryo try to use tungsten Spark 1.5.1 enabled it by default – Umesh K Oct 18 '15 at 21:08
  • 1
    @shekhar yarn nodemanager logs doesn't alway reveal reason for KILLING. – nir Mar 01 '16 at 13:01
  • I am having similar issue. I did turned off yarn plimit and vlimit check though container are being killed without apparent reason. For me same job logic works on mapreduce-yarn with much less memory – nir Mar 01 '16 at 13:04

3 Answers3

42

I had a very similar problem. I had many executors being lost no matter how much memory we allocated to them.

The solution if you're using yarn was to set --conf spark.yarn.executor.memoryOverhead=600, alternatively if your cluster uses mesos you can try --conf spark.mesos.executor.memoryOverhead=600 instead.

In spark 2.3.1+ the configuration option is now --conf spark.yarn.executor.memoryOverhead=600

It seems like we were not leaving sufficient memory for YARN itself and containers were being killed because of it. After setting that we've had different out of memory errors, but not the same lost executor problem.

whaleberg
  • 2,093
  • 22
  • 26
  • 1
    Do you mean `--conf spark.yarn.executor.memoryOverhead=600` which set the overhead memory to 600 MB? Is 600 MB enough for your application? I am having a simple application. I had to set the overhead memory to 3000 to avoid outOfMemory exception. – panc Jul 08 '16 at 22:07
  • @PanChao Yes, there should be an equals there, thanks for pointing it out. Setting the yarn overhead memory fixed an issue where executors would die without giving any error. It's setting how much memory yarn itself has access to, not how much memory your application can use. If you're seeing an out of memory error you may need to increase either `spark.driver.memory` or `spark.executor.memory` depending on where the error is occurring. Have you tried that? – whaleberg Jul 11 '16 at 18:40
  • Yes, I also increased the values for those two memory parameters. The out of memory exception I encountered is related to the virtual memory limit being exceeded, the issue that is similar to the one reported in this [thread](http://stackoverflow.com/questions/35355823/spark-worker-asking-for-absurd-amounts-of-virtual-memory). It seems that increasing the over head memory could solve this issue, though I am still confused why Spark uses so much virtual memory. Do you know any reference explaining the "pyspark.daemon process" mentioned in the answer of that thread? – panc Jul 12 '16 at 02:26
  • Sparking of the difference between the memory used by Yarn and memory used by Spark application. Since Yarn manages the resources, won't the Spark app asks for memory from Yarn when it needs more memory? – panc Jul 12 '16 at 02:31
  • I'm not familiar at all with the pyspark specific options. I've only ever used spark with java/scala. I believe the yarn overhead memory is subtracted from the available memory in the yarn container, so increasing it too high would potentially cause out of memory errors. – whaleberg Jul 12 '16 at 21:00
  • In general the jvm requests large amounts of virtual memory, this isn't usually a problem because you can safely overcommit virtual memory as long as the amount of real memory that's being used is smaller. I think it might be worth opening a new question and including your error message as well as your launch settings. – whaleberg Jul 12 '16 at 21:06
  • 3
    This works for `mesos` too. I had the same problem. Instead it's `--conf spark.mesos.executor.memoryOverhead=600` – Luke Aug 04 '16 at 16:58
  • 1
    How to set `spark.yarn.executor.memoryOverhead` in Scala REPL? – Alex Raj Kaliamoorthy Aug 05 '16 at 06:41
  • It didn't work for me. I even increased spark.yarn.executor.memoryOverhead to 3000 and still getting lost executor error – Morteza Mashayekhi Dec 18 '17 at 14:47
  • 7
    In Spark version 2.3.1 the configuration is now "spark.executor.memoryOverhead" instead of "spark.yarn.executor.memoryOverhead". – Constantino Cronemberger Sep 19 '18 at 14:15
3

You can follow this AWS post to calculate memory overhead (and other spark configs to tune): best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr

piritocle
  • 318
  • 2
  • 10
-8

When I had the same issue, deleting logs and free up more hdfs space worked.

ChrisF
  • 134,786
  • 31
  • 255
  • 325
Karn_way
  • 1,005
  • 3
  • 19
  • 42
  • @whaleberg Even i have same issue but I dont see any momory issue in logs. What could be the reason to node go to down? Error – BdEngineer Jan 07 '19 at 13:26