1

I've been running a Spark Application and one of the Stages failed with a FetchFailedException. At roughly the same time a log similar to the following appeared in the resource manager logs.

<data> <time>,988 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAudtiLogger: User=<user> OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=<appid> CONTAINERID=<containerid>

My application was using more than yarn allocated it however it had been running for several days. What I expect happened is that other applications started and wanted to use the cluster and the Resource Manager killed one of my containers to give the resources to the others.

Can anyone help me verify my assumption and/or point me to the documentation that describes the log messages that the Resource Manager outputs?

Edit: If it helps the Yarn version I'm running is 2.6.0-cdh5.4.9

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
A Spoty Spot
  • 746
  • 6
  • 19

1 Answers1

0

The INFO message about OPERATION=AM Released Container is from org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt. My vague understanding of the code tells me that it was due to successful container release meaning that the container for the ApplicationMaster of your Spark application finished successfully.

I've just answered a similar question Why Spark application on YARN fails with FetchFailedException due to Connection refused? (yours was almost a duplicate).

FetchFailedException exception is thrown when a reducer task (for a ShuffleDependency) could not fetch shuffle blocks.

The root cause of the FetchFailedException is usually because the executor (with the BlockManager for the shuffle blocks) is lost (i.e. no longer available) due to:

  1. OutOfMemoryError could be thrown (aka OOMed) or some other unhandled exception.
  2. The cluster manager that manages the workers with the executors of your Spark application, e.g. YARN, enforces the container memory limits and eventually decided to kill the executor due to excessive memory usage.

You should review the logs of the Spark application using web UI, Spark History Server or cluster-specific tools like yarn logs -applicationId for Hadoop YARN (which is your case).

A solution is usually to tune the memory of your Spark application.

Community
  • 1
  • 1
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420