12

I'd like to understand the internals of Spark's FAIR scheduling mode. The thing is that it seems not so fair as one would expect according to the official Spark documentation:

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

It seems like jobs are not handled equally and actually managed in fifo order.

To give more information on the topic:

I am using Spark on YARN. I use the Java API of Spark. To enable the fair mode, The code is :

SparkConf conf = new SparkConf();
conf.set("spark.scheduler.mode", "FAIR");
conf.setMaster("yarn-client").setAppName("MySparkApp");
JavaSparkContext sc = new JavaSparkContext(conf);

Did I miss something?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Thomas2033
  • 143
  • 1
  • 1
  • 8

1 Answers1

11

It appears that you didn't set up the pools and all your jobs end up in a single default pool as described in Configuring Pool Properties:

Specific pools’ properties can also be modified through a configuration file.

and later

A full example is also available in conf/fairscheduler.xml.template. Note that any pools not configured in the XML file will simply get default values for all settings (scheduling mode FIFO, weight 1, and minShare 0).

It can also be that you didn't set up the local property to set up the pool to use for a given job(s) as described in Fair Scheduler Pools:

Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them.

It can finally mean that you use a single default FIFO pool so one pool in FIFO mode changes nothing comparing to FIFO without pools.

It's only you to know the real answer :)

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • In default pool jobs run in parallel, if they are submitted through different threads, i have seen it running parallel. Don't think we need to create pools for just parallelizing jobs. "each pool gets an equal share of the cluster (also equal in share to each job in the default pool)" from http://spark.apache.org/docs/latest/job-scheduling.html#default-behavior-of-pools – spats Jan 11 '18 at 00:15
  • 1
    That's correct if # CPUs > # tasks from unrelated stages. – Jacek Laskowski Jan 11 '18 at 11:30
  • Guys i'm confused here. I am actually trying the Fair scheduler from spark-shell. I did not configure any pools at all as I think I understood the same thing as spats. So can someone observe a fair scheduling at work without setting any pools but only relying on the default pool ? @Jacek Laskowski what do you mean by "# CPUs > # tasks from unrelated stages.", and where does that comes from ? – MaatDeamon Aug 29 '18 at 22:21
  • 4
    @MaatDeamon You can pass --conf spark.scheduler.mode=FAIR --conf spark.scheduler.allocation.file=/path/to/fair.xml to your spark shell. In your file specify schedulingMode FAIR for pool "default". You will be able to see on Stages UI Tab that you have default pool with FAIR scheduling. However, and this is important, in the absence of other pools you will have the same effect as having FIFO mode, unless in your shell you spawn jobs from different threads. See my detailed answer here https://stackoverflow.com/a/54488965/2505920 – kasur Feb 02 '19 at 01:05