0

I want to run multiple spark SQL parallel in a spark cluster, so that I can utilize the complete resource. I'm using sqlContext.sql(query).

EDITS

I saw some sample code here like follows,

val parallelism = 10
val executor = Executors.newFixedThreadPool(parallelism)
val ec: ExecutionContext = ExecutionContext.fromExecutor(executor)
val tasks: Seq[String] = ???
val results: Seq[Future[Int]] = tasks.map(query => {
  Future{
    //spark stuff here
    0
  }(ec)
})
val allDone: Future[Seq[Int]] = Future.sequence(results)
//wait for results
Await.result(allDone, scala.concurrent.duration.Duration.Inf)
executor.shutdown //otherwise jvm will probably not exit 

As I understood, the ExecutionContext compute the available cores in the machine(using ForkJoinPool) and do the parallelism accordingly. But what happens if we consider the spark cluster other-than the single machine and How can it guarantee the complete cluster resource utilization.?

eg: If I have a 10 node cluster with each 4 cores, then how can the above code guarantees that the 40 cores will be utilized.

Devas
  • 1,544
  • 4
  • 23
  • 28
  • Thanks for your help. But that question is unanswered. Also a sample code will be more helpful. – Devas May 02 '18 at 12:05
  • 1
    Here you are: [How can I parallelize different SparkSQL execution efficiently?](https://stackoverflow.com/a/50058194/9613318) – Alper t. Turker May 02 '18 at 12:07
  • Waw, thats very much helpful. Thank you. Will try the same. :) – Devas May 02 '18 at 12:10
  • If you search, you'll find a bunch of other similar questions, some probably with better answers. There is for example [Processing multiple files as independent RDD's in parallel](https://stackoverflow.com/q/31912858/9613318). My search skills are just not top-notch today :) – Alper t. Turker May 02 '18 at 12:12

0 Answers0