4

I have a Spark (1.4.1) application, running on Yarn, that fails with the following executor log entry:

16/07/21 23:09:08 ERROR executor.CoarseGrainedExecutorBackend: Driver 9.4.136.20:55995 disassociated! Shutting down.
16/07/21 23:09:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /dfs1/hadoop/yarn/local/usercache/mitchus/appcache/application_1465987751317_1172/blockmgr-f367f43b-f4c8-4faf-a829-530da30fb040/1c/temp_shuffle_581adb36-1561-4db8-a556-c4ac0e6400ed
java.io.FileNotFoundException: /dfs1/hadoop/yarn/local/usercache/mitchus/appcache/application_1465987751317_1172/blockmgr-f367f43b-f4c8-4faf-a829-530da30fb040/1c/temp_shuffle_581adb36-1561-4db8-a556-c4ac0e6400ed (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(BlockObjectWriter.scala:189)
    at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:328)
    at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:257)
    at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:95)
    at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:83)
    at org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:95)
    at org.apache.spark.util.collection.ExternalSorter.maybeSpillCollection(ExternalSorter.scala:240)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Any clues as to what might have gone wrong?

mitchus
  • 4,677
  • 3
  • 35
  • 70
  • Mind upgrading to 1.6.2 (or soon 2.0)? There were some issues reported similar to your case and fixed in the recent releases. – Jacek Laskowski Jul 23 '16 at 19:23
  • @JacekLaskowski I would like that, but it's not up to me. – mitchus Jul 25 '16 at 08:02
  • 1
    I got a similar message earlier today with Spark 2.0 under SparkR; restarting my session seemed to clear the error - probably won't help OP, but just sayin'. – russellpierce Aug 14 '16 at 21:48
  • 1
    for me too restarting the spark worked @rpierce Thanks. – desaiankitb May 17 '17 at 06:02
  • 1
    Did you by any chance set the master as 'local' in your spark context and then used spark submit in yarn mode? – seagull1089 Jun 05 '17 at 22:33
  • @seagull1089, Can you elaborate on where should can specify my spark context as non `local`? I am creating my SparkContext object as follows. `sc = SparkContext(appName = "Tracks")` – Ravi Chandra Jun 06 '17 at 07:22
  • @RaviChandra: something like this: val conf = new SparkConf().setAppName("Application Name") conf.setMaster("local[*]") val sc = new SparkContext(conf) – ceteras Jun 22 '17 at 06:57
  • Can it be related to https://stackoverflow.com/questions/25707629/why-does-spark-job-fail-with-too-many-open-files ? – ucsky Mar 13 '18 at 19:30

2 Answers2

2

The reason caused by temp shuffle file is deleted. There are many reasons, for one which I met is because the other executor was killed by Yarn. After the executor killed, a SHUT_DOWN signal will be sent to other executors, then the ShutdownHookManager will delete all the temp files which have registered to ShutdownHookManager. That's why you see the error. So you maybe need to check whether there are any ShutdownHookManager called log.

lxy
  • 449
  • 4
  • 14
0

You can try to improve spark.yarn.executor.memoryOverhead.

Prasad Khode
  • 6,602
  • 11
  • 44
  • 59
huron
  • 762
  • 1
  • 5
  • 22