Questions tagged [hadoop-yarn]

YARN (Yet Another Resource Negotiator) is a key component of second generation Apache Hadoop infrastructure. DO NOT USE THIS for the JavaScript/Node.js Yarn package manager (use [yarnpkg] instead)! Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications including next generation MapReduce (MR2).

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Background

The current implementation of the Hadoop MapReduce framework is showing it’s age.

Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, there has been spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: MAPREDUCE-278.

From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.

The Next Generation of MapReduce

Yarn Architecture

                          Figure:  Yarn Architecture

The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.

MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

3897 questions

251

votes

9 answers

Apache Spark: The number of cores vs. the number of executors

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. The test environment is as follows: Number of data nodes: 3 Data node machine spec: CPU: Core i7-4790 (# of cores: 4, #…

hadoop apache-spark hadoop-yarn

asked Jul 08 '14 at 00:46

zeodtr

10,645
14
43
60

136

votes

5 answers

How to kill a running Spark application?

I have a running Spark application where it occupies all the cores where my other applications won't be allocated any resource. I did some quick research and people suggested using YARN kill or /bin/spark-class to kill the command. However, I am…

apache-spark hadoop-yarn pyspark

asked Apr 10 '15 at 15:51

B.Mr.W.

18,910
35
114
178

votes

9 answers

Container is running beyond memory limits

In Hadoop v1, I have assigned each 7 mapper and reducer slot with size of 1GB, my mappers & reducers runs fine. My machine has 8G memory, 8 processor. Now with YARN, when run the same application on the same machine, I got container error. By…

hadoop mapreduce hadoop-yarn mrv2

asked Jan 08 '14 at 20:18

Lishu

1,438
1
13
14

votes

4 answers

Which cluster type should I choose for Spark?

I am new to Apache Spark, and I just learned that Spark supports three types of cluster: Standalone - meaning Spark will manage its own cluster YARN - using Hadoop's YARN resource manager Mesos - Apache's dedicated resource manager project I think…

apache-spark hadoop-yarn mesos apache-spark-standalone

asked Feb 22 '15 at 23:44

David S.

10,578
12
62
104

votes

2 answers

Hadoop truncated/inconsistent counter name

For now, I have a Hadoop job which creates counters with a pretty big name. For example, the following one:…

java hadoop mapreduce hadoop-yarn

asked Jan 17 '17 at 15:32

mr.nothing

5,141
10
53
77

votes

4 answers

Spark yarn cluster vs client - how to choose which one to use?

The spark docs have the following paragraph that describes the difference between yarn client and yarn cluster: There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an…

apache-spark hadoop-yarn

asked Dec 13 '16 at 15:11

Chris Snow

23,813
35
144
309

votes

3 answers

How to prevent Spark Executors from getting Lost when using YARN client mode?

I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following error and slowly all executors get removed from UI and my job fails 15/07/30 10:18:13 ERROR cluster.YarnScheduler:…

apache-spark hadoop-yarn

asked Jul 30 '15 at 15:59

Umesh K

13,436
25
87
129

votes

4 answers

Where are logs in Spark on YARN?

I'm new to spark. Now I can run spark 0.9.1 on yarn (2.0.0-cdh4.2.1). But there is no log after execution. The following command is used to run a spark example. But logs are not found in the history server as in a normal MapReduce…

hadoop logging apache-spark cloudera hadoop-yarn

asked Apr 14 '14 at 11:15

DeepNightTwo

4,809
8
46
60

votes

6 answers

What is yarn-client mode in Spark?

Apache Spark has recently updated the version to 0.8.1, in which yarn-client mode is available. My question is, what does yarn-client mode really mean? In the documentation it says: With yarn-client mode, the application will be launched locally.…

hadoop-yarn apache-spark

asked Dec 27 '13 at 01:56

zxz

1,012
2
11
14

votes

5 answers

FetchFailedException or MetadataFetchFailedException when processing big data set

When I run the parsing code with 1 GB dataset it completes without any error. But, when I attempt 25 gb of data at a time I get below errors. I'm trying to understand how can I avoid below failures. Happy to hear any suggestions or ideas. Differnt…

apache-spark hadoop-yarn

asked Jan 22 '16 at 07:39

WoodChopper

4,265
6
31
55

votes

2 answers

What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?

I would like to know the relation between the mapreduce.map.memory.mb and mapred.map.child.java.opts parameters. Is mapreduce.map.memory.mb > mapred.map.child.java.opts?

apache hadoop configuration hadoop-yarn heap-size

asked Jun 05 '14 at 21:34

yedapoda

votes

13 answers

Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html I am running as below command on ec2 instance : ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname…

apache-spark hadoop-yarn amazon-emr amazon-kinesis

asked Jun 14 '15 at 11:35

Sam

1,333
5
23
36

votes

1 answer

Why does a JVM report more committed memory than the linux process resident set size?

When running a Java app (in YARN) with native memory tracking enabled (-XX:NativeMemoryTracking=detail see https://docs.oracle.com/javase/8/docs/technotes/guides/vm/nmt-8.html and…

linux memory jvm hadoop-yarn

asked Jul 01 '15 at 23:36

Dave L.

43,907
11
63
62

votes

9 answers

What is a container in YARN?

What is a container in YARN? Is it same as the child JVM in which the tasks on the nodemanager run or is it different?

hadoop mapreduce hadoop-yarn

asked Jan 16 '13 at 18:28

rahul

1,423
3
18
28

votes

4 answers

How to set amount of Spark executors?

How could I configure from Java (or Scala) code amount of executors having SparkConfig and SparkContext? I see constantly 2 executors. Looks like spark.default.parallelism does not work and is about something different. I just need to set amount of…

java scala cluster-computing apache-spark hadoop-yarn

asked Oct 02 '14 at 19:34

Roman Nikitchenko

12,800
7
74
110

2 3

…

99 100 Next