23

On 3 node Spark/Hadoop cluster which scheduler(Manager) will work efficiently? Currently I am using Standalone Manager, but for each spark job I have to explicitly specify all resource parameters(e.g: cores,memory etc),which I want to avoid. I have tried Yarn as well, but it's running 10X slower than standalone manager.

Can Mesos will be helpful?

Cluster Details: Spark 1.2.1 and Hadoop 2.7.1

Abhinandan Satpute
  • 2,558
  • 6
  • 25
  • 43
  • [Disclaimer: Not a Yarn expert] I think it strongly depends on what future workload you plan to add to your cluster. Mesos is a generic scheduler, while Yarn is more tailored for Hadoop workloads. – rukletsov Aug 04 '15 at 13:04
  • 2
    Have a look at related SE question: http://stackoverflow.com/questions/28664834/which-cluster-type-should-i-choose-for-spark/34657719#34657719 – Ravindra babu Sep 06 '16 at 10:19

3 Answers3

36

Apache Spark runs in the following cluster modes

  • Local
  • Standalone
  • YARN
  • Mesos
  • Kubernetes
  • Nomad

Local mode is used to run Spark applications on Operating system. This mode is useful for Spark application development and testing.

Modes like standalone, Yarn, Mesos and Kubernetes modes are distributed environment. In distributed environment, resource management is very important to manage the computing resources. So to manage computing resources in efficient way, we need good resource management system or Resource Schedular.

Standalone is good for small spark clusters, but it is not good for bigger clusters (There is an overhead of running spark daemons(master + slave) in cluster nodes). These daemons require dedicated resources. So standalone is not recommended for bigger production clusters. Standalone supports only Spark applications and it is not general purpose cluster manager. In Enterprise context where we have variety of work loads to run, spark standalone cluster manager is not a good a choice.

In case of YARN and Mesos mode, Spark runs as an application and there are no daemons overhead. So we can use either YARN or Mesos for better performance and scalability. Both YARN and Mesos are general purpose distributed resource management and they support a variety of work loads like MapReduce, Spark, Flink, Storm etc... with container orchestration. They are good for running large scale Enterprise production clusters.

In between YARN and Mesos, YARN is specially designed for Hadoop work loads whereas Mesos is designed for all kinds of work loads. YARN is application level scheduler and Mesos is OS level scheduler. it is better to use YARN if you have already running Hadoop cluster (Apache/CDH/HDP). In case of a brand new project, better to use Mesos(Apache, Mesosphere). There is also a provision to use both of them in colocated manner using Project called Apache Myriad.

Kubernetes - Open source system for automating deployment, scaling, and management of containerized applications. So it used for running Spark applications in containerized fashion. Most of the cloud vendors like Google, Microsoft, Amazon offering Kubernetes platform as service in Cloud. We can also have on-prim K8S cluster to run Spark applications in containerized fashion. Here containers are Docker or CGroups/Linux Container.

Nomad - It is another open source system for running Spark applications. This cluster manager is not officially supported by the Spark project as a cluster manager.

Out of all above modes, Apache Mesos has better resource management capabilities.

Please see this link, it contains a detailed explanation from expertise about Yarn vs Mesos. http://www.quora.com/How-does-YARN-compare-to-Mesos

Naga
  • 1,203
  • 11
  • 21
8

On a 3 node cluster I'd just go with the standalone manager the overhead of the additional processes would not pay off

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
1

From Spark 3.x.x there are several Cluster Manager modes:

  • Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
  • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
  • Hadoop YARN – the resource manager in Hadoop 2.
  • Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.

more about https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types