0

I am aware of the basics of YARN framework, however I still feel lack of some understanding, in regards to MapReduce.

With YARN, I have read that MapReduce is just one of the applications which can run on top of YARN; for example, with YARN, on same cluster various different jobs can run, MapReduce Jobs, Spark Jobs etc.

Now, the point is, each type of job has its "own" kind of "Job phases", for example, when we talk about MapReduce, it has various phases like, Mapper, Sorting, Shuffle, Reducer etc.

Specific to this scenario, who "decides", "controls" these phases? Is it MapReduce Framework?

As I understand, YARN is an infrastructure on which different jobs run; so when we submit a MapReduce Job, does it first go to MapReduce framework and then the code is executed by YARN? I have this doubt, because YARN is general purpose execution engine, so it won't be having knowledge of mapper, reducer etc., which is specific to MapReduce (and so different kind of Jobs), so does MapReduce Framework run on top of YARN, with YARN help executing the Jobs, and MapReduce Framework is aware of the phases it has to go through for a particular kind of Job?

Any clarification to understand this would be of great help.

CuriousMind
  • 8,301
  • 22
  • 65
  • 134
  • 1
    Go through this blog series : https://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/ – sujit Mar 30 '18 at 07:17

2 Answers2

1

So when we submit a MapReduce job it will first go to the Resource Manager which is the master daemon of YARN. The Resource Manager then selects a Node Manager(which are slave processes of YARN) to start a container on which it will ask the Node Manager to start a very lightweight process known as Application Master. Then the Resource Manager will ask the Application Master to start execution of the job. The Application Master will first go through the driver part of the job from where it would get to know of the resources that would be used for the job, and accordingly it will request the Resource Manager for those resources. Now the Resource Manager can assign the resources to the Application Master immediately or if the cluster is to occupied then that request would be rescheduled based on various scheduling algorithms. After getting the resources the Application Master would go to the Name Node to get the metadata of all the blocks that would be required to be processed for this job. After receiving the Metadata the Application Master would ask the Node Managers of the nodes where the blocks are stored(if those nodes are too busy then a node in the same rack, otherwise any random node depending on rack awareness) and ask the Node Managers to launch containers for processing their respective blocks. The blocks would get processed independently and in parallel in their respective nodes. After the entire processing is done the result would be stored in HDFS.

Rajnil Guha
  • 425
  • 1
  • 4
  • 15
1

If you take a look at this picture from Hadoop documentation:

Yarn Architecture

You'll see that there's no particular "job orchestration" component, but a resource requesting component, called application master. As you mentioned, YARN does resource management and with regards to application orchestration, it stops at an abstract level.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

When applied to Spark, some of the components in that picture would be:

  • Client: the spark-submit process
  • App Master: Spark's application master that runs driver and application master (cluster mode) or just application master (client mode)
  • Container: spark workers

Spark's YARN infrastructure provides the application master (in YARN terms), which knows about Spark's architecture. So when the driver runs, either in cluster mode or in client mode, it still decides on jobs/stages/tasks. This must be application/framework-specific (Spark being the "framework" when it comes to YARN).

From Spark documentation on YARN deployment:

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN

You can extend this abstraction to map-reduce, given your understanding of that framework.

ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • Thanks for your answer. So, if we extend it to MapReduce, does the framework specifics being controlled by "MRAppMaster" ? From YARN perspective it knows nothing about any of the phases, it just allocates resources, and the actual framework specific things being handled by respective container? Is this interpretations / understanding correct? – CuriousMind Mar 30 '18 at 05:59
  • 1
    Not sufficiently educated in MR to confirm that, but that sounds right. I see MR code supplies an application master, which would make it analogous to what's described above, meaning that your own MR application deals with the tasks while the application master interface allows negotiating resources with YARN. Seen much in-depth explanation on this page: https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html – ernest_k Mar 30 '18 at 06:18