In simple terms, how does Spark schedule jobs?

Question

Just wondering how does Spark schedule jobs? In simple terms please, I have read many descriptions of how it does it but they were too complicated to understand.

score 6 · Answer 1 · edited May 23 '17 at 11:53

At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.

The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.
The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager.(Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.
The Worker executes the tasks on the Slave.

look at this answer for more information

Assume I am runnning Spark Client program in a client mode and Spark Cluster in Standalone mode. who creates a DAG? Does Spark Client creates a DAG or Spark Master creates DAG? Does Spark Client Program instruct workers on what transformations to run? — user1870400, Dec 01 '16 at 20:14

score 2 · Answer 2 · answered Jun 16 '15 at 08:25

Depends on what you are calling jobs - if you are talking about independent submits, this is not actually handled by spark but rather by the host environment (mesos or Hadoop YARN)

Different jobs within a single spark-context would, by default, use FIFO unless you configure it to use FAIR scheduler

score 1 · Answer 3 · answered Jun 16 '15 at 06:51

1

I think spark jobs are FIFO(first in first out).

answered Jun 16 '15 at 06:51

KlwntSingh

1,084
8
26

score 1 · Answer 4 · answered Jun 16 '15 at 08:06

Spark’s scheduler runs jobs in FIFO fashion.

It is also possible to configure fair sharing between jobs.

To enable the fair scheduler, simply set the spark.scheduler.mode property to FAIR when configuring a SparkContext:

> val conf = new SparkConf().setMaster(...).setAppName(...)
> conf.set("spark.scheduler.mode", "FAIR") val sc = new
> SparkContext(conf)

For more details, Please looks at https://spark.apache.org/docs/1.2.0/job-scheduling.html

score 0 · Answer 5 · edited Sep 20 '16 at 08:23

Good question. The terms get used in different ways in different places and can be challenging. The most confusing thing with Spark is that a single run of an application can spawn multiple jobs, each of which is broken into multiple tasks! For example, if an application is multi-threaded, each thread can generate a Spark job. But, in the normal case, an application is one-to-one with a job. One run of an application generates one job.

Now, Spark was made in a flexible way, so it decouples the scheduling piece and makes that pluggable. Many different schedulers can be plugged in. The most popular 3 are YARN, from Hadoop, Mesos, and Spark's own built in scheduler. So, there are a variety of behaviors of the scheduling.

The next confusing thing is that both jobs and tasks are scheduled. A job is assigned resources. This can be done statically, so that, say a set of servers is assigned to a job, and then that job is the only one that can use those servers. Or, resources can be shared amongst jobs. Once resources are assigned, the job is pointed to a task scheduler. The job then generates tasks and gives them to the task scheduler, which assigns the tasks to particular resources. The same entity that assigns resources to jobs also supplies the tasks scheduler (IE, YARN, Mesos, or Spark built-in). So there is also variability in how the task scheduler works.

In general the schedulers try to track the location of data and then assign tasks to locations where the data already resides or where there is plenty of available network capacity for moving the data.

The complicating factor is that tasks have dependencies on each other. Enforcing those dependencies is actually part of the scheduling process, but Spark terminology gets confused on that point. In Spark, only the final assignment of tasks to processors is termed "scheduling".

score 0 · Answer 6 · answered Sep 21 '16 at 18:58

i'll give you an example,

Suppose you have an application which does the following,
1. Read data from HDFS
2. Filter op on column_1
3. Filter op on column_2
4. Filter op on column_3
5. Write the RDD to HDFS
Spark's DAGScheduler analyzes the course of action in the application and designs the best approach to achieve the task
By which I mean, it instead of having separate stages for each of the filter operations, it will treat all three filters as one stage. So instead of going through the dataset three times for the filters, it will scan just once. Which is surely optimized way of doing it.

hope this helps.

score 0 · Answer 7 · edited May 17 '23 at 05:50

0

Databricks session in Spark summit on Scheduler :

Deep Dive into the Apache Spark Scheduler (Xingbo Jiang) - YouTube

edited May 17 '23 at 05:50

John Rotenstein

241,921
22
380
470

answered Jun 18 '20 at 00:09

Shiva Garg

826
9
17

In simple terms, how does Spark schedule jobs?

7 Answers7