4

Just wondering how does Spark schedule jobs? In simple terms please, I have read many descriptions of how it does it but they were too complicated to understand.

user4157124
  • 2,809
  • 13
  • 27
  • 42
user2768498
  • 155
  • 1
  • 2
  • 9

7 Answers7

6

At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.

  • The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.

  • The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager.(Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.

  • The Worker executes the tasks on the Slave.

look at this answer for more information

Community
  • 1
  • 1
Sathish
  • 4,975
  • 3
  • 18
  • 23
  • Assume I am runnning Spark Client program in a client mode and Spark Cluster in Standalone mode. who creates a DAG? Does Spark Client creates a DAG or Spark Master creates DAG? Does Spark Client Program instruct workers on what transformations to run? – user1870400 Dec 01 '16 at 20:14
2

Depends on what you are calling jobs - if you are talking about independent submits, this is not actually handled by spark but rather by the host environment (mesos or Hadoop YARN)

Different jobs within a single spark-context would, by default, use FIFO unless you configure it to use FAIR scheduler

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
1

I think spark jobs are FIFO(first in first out).

KlwntSingh
  • 1,084
  • 8
  • 26
1

Spark’s scheduler runs jobs in FIFO fashion.

It is also possible to configure fair sharing between jobs.

To enable the fair scheduler, simply set the spark.scheduler.mode property to FAIR when configuring a SparkContext:

> val conf = new SparkConf().setMaster(...).setAppName(...)
> conf.set("spark.scheduler.mode", "FAIR") val sc = new
> SparkContext(conf)

For more details, Please looks at https://spark.apache.org/docs/1.2.0/job-scheduling.html

0

Good question. The terms get used in different ways in different places and can be challenging. The most confusing thing with Spark is that a single run of an application can spawn multiple jobs, each of which is broken into multiple tasks! For example, if an application is multi-threaded, each thread can generate a Spark job. But, in the normal case, an application is one-to-one with a job. One run of an application generates one job.

Now, Spark was made in a flexible way, so it decouples the scheduling piece and makes that pluggable. Many different schedulers can be plugged in. The most popular 3 are YARN, from Hadoop, Mesos, and Spark's own built in scheduler. So, there are a variety of behaviors of the scheduling.

The next confusing thing is that both jobs and tasks are scheduled. A job is assigned resources. This can be done statically, so that, say a set of servers is assigned to a job, and then that job is the only one that can use those servers. Or, resources can be shared amongst jobs. Once resources are assigned, the job is pointed to a task scheduler. The job then generates tasks and gives them to the task scheduler, which assigns the tasks to particular resources. The same entity that assigns resources to jobs also supplies the tasks scheduler (IE, YARN, Mesos, or Spark built-in). So there is also variability in how the task scheduler works.

In general the schedulers try to track the location of data and then assign tasks to locations where the data already resides or where there is plenty of available network capacity for moving the data.

The complicating factor is that tasks have dependencies on each other. Enforcing those dependencies is actually part of the scheduling process, but Spark terminology gets confused on that point. In Spark, only the final assignment of tasks to processors is termed "scheduling".

Tunaki
  • 132,869
  • 46
  • 340
  • 423
seanhalle
  • 973
  • 7
  • 27
0

i'll give you an example,

  • Suppose you have an application which does the following,

    1. Read data from HDFS
    2. Filter op on column_1
    3. Filter op on column_2
    4. Filter op on column_3
    5. Write the RDD to HDFS
  • Spark's DAGScheduler analyzes the course of action in the application and designs the best approach to achieve the task
  • By which I mean, it instead of having separate stages for each of the filter operations, it will treat all three filters as one stage. So instead of going through the dataset three times for the filters, it will scan just once. Which is surely optimized way of doing it.

hope this helps.

avrsanjay
  • 805
  • 7
  • 12
0

Databricks session in Spark summit on Scheduler :

Deep Dive into the Apache Spark Scheduler (Xingbo Jiang) - YouTube

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Shiva Garg
  • 826
  • 9
  • 17