22

one spark has one oracle query. so I have to run multiple jobs in parallel so that all queries will fire at the same time.

How to run multiple jobs in parallel?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Nagendra Palla
  • 239
  • 1
  • 2
  • 5
  • 2
    You can submit multiple jobs through the same spark context if you make calls from different threads (actions are blocking). But the scheduling will have the final word on how "in parallel" those jobs run. – ernest_k Mar 30 '18 at 06:51
  • 1
    @NagendraPalla `spark-submit` is to submit a Spark application for execution (not jobs). – Jacek Laskowski Mar 30 '18 at 14:21
  • 1
    @ErnestKiwele actions **may** be blocking but `SparkContext` has two methods to submit a Spark job for execution, `runJob` and submitJob`, and so it does **not** really matter whether an action is blocking or not but what SparkContext method to use to have non-blocking behaviour. – Jacek Laskowski Mar 30 '18 at 14:23
  • @JacekLaskowski Thanks, that's interesting. Could you perhaps add an example snippet to your answer? Would help to see how async/non-blocking jobs can be started with RDD action methods (or whatever API that's applicable)... – ernest_k Mar 30 '18 at 14:33
  • 2
    @ErnestKiwele "RDD action methods" are already written and their implementations use whatever Spark devs bet on (mostly `SparkContext.runJob` as in [count](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1162)). You'd have to write your own RDD actions (on a custom RDD) to have required non-blocking feature in your Spark app. Ask another question here to **spark** my interest to demo it ;-) – Jacek Laskowski Mar 30 '18 at 14:36
  • 1
    @JacekLaskowski OK. I supposed your comment on 'blocking' was made in the context of standard RDD API (as my first one was). But thanks for the observation. – ernest_k Mar 30 '18 at 14:41
  • @ErnestKiwele Not that easy :) See http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.AsyncRDDActions. I leave it to you to find the uses of the API (and I'd call them part of "standard RDD API" too) :) – Jacek Laskowski Mar 30 '18 at 14:43
  • see e.g. https://stackoverflow.com/a/48845764/1138523 – Raphael Roth Mar 31 '18 at 19:30

3 Answers3

35

Quoting the official documentation on Job Scheduling:

Second, within each Spark application, multiple "jobs" (Spark actions) may be running concurrently if they were submitted by different threads.

In other words, a single SparkContext instance can be used by multiple threads that gives the ability to submit multiple Spark jobs that may or may not be running in parallel.

Whether the Spark jobs run in parallel depends on the number of CPUs (Spark does not track the memory usage for scheduling). If there are enough CPUs to handle the tasks from multiple Spark jobs they will be running concurrently.

If however the number of CPUs is not enough you may consider using FAIR scheduling mode (FIFO is the default):

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.


Just to clear things up a bit.

  1. spark-submit is to submit a Spark application for execution (not Spark jobs). A single Spark application can have at least one Spark job.

  2. RDD actions may or may not be blocking. SparkContext comes with two methods to submit (or run) a Spark job, i.e. SparkContext.runJob and SparkContext.submitJob, and so it does not really matter whether an action is blocking or not but what SparkContext method to use to have non-blocking behaviour.

Please note that "RDD action methods" are already written and their implementations use whatever Spark developers bet on (mostly SparkContext.runJob as in count):

// RDD.count
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

You'd have to write your own RDD actions (on a custom RDD) to have required non-blocking feature in your Spark app.

Community
  • 1
  • 1
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
0

In Scala, you can run multiple jobs in parallel using parallelized collections. You can parallelize a list using the method .par:

val myParallelList = List(1,2,3,4).par

When you use a map or a foreach on a parallelized list, the executions will be run in parallel.

So you could do something like the following:

List(df1, df2, df3, df4, df5)
  .zipWithIndex
  .par
  .foreach(dfAndIndex =>
    dfAndIndex._1.write.parquet("/tmp/dataframe${dfAndIndex._2}")
  )

This would result in the 5 dataframes being written in parallel. If you look at the Spark UI, you should see the jobs running at the same time. You can adapt this snippet to run any group of jobs in parallel.

leleogere
  • 908
  • 2
  • 15
-2

With no further explanation, I assume that by a spark job you mean a Spark action and all tasks needed to evaluate that action? If that is the case, natively you can take a look at the official doc(pay attention to your spark version): https://spark.apache.org/docs/latest/job-scheduling.html

If that is not the solution you want to follow and you want to do the hackish-not-scalable-way-just-get-things-done, schedule the jobs for the same time in cron using spark-submit - https://spark.apache.org/docs/latest/submitting-applications.html

And of course, if you are in a place where you have schedulers, then it is a simple case of looking into its documentation - Oozie, Airflow, Luigi, Nifi and even old-school GUI's like control+m support this.

Thiago de Faria
  • 195
  • 2
  • 8