4

Question

What is the motivation of creating multiple Spark applications/sessions instead of sharing a global session?

Explanation

You have Spark Standalone cluster manager.

Cluster:

  • 5 machines
  • 2 cores (executors) each = totally 10 executors
  • 16 GB RAM each machine

Jobs:

  • Dump database, requires all (10) executors, but only 1 GB RAM on each executor.
  • Handle dump results, requires 5 executors with 8-16 GB RAM each.
  • Quick data retrieval task, 5 executors with 1 GB RAM each.
  • etc

Which solution is a best practice? Why I should ever prefer 1st solution over 2nd, or 2nd over 1st if the resource of the cluster remains the same?

Solutions:

  1. Launch 1st, 2nd and 3rd jobs from different Spark applications (JVMs).
  2. Use single global Spark application/session, which holds all resources of the cluster (10 executors, each 8 GB RAM). Create fair scheduler pool for 1st, 2nd and 3rd jobs.
  3. Use some hacks like this to run jobs with different configs from single JVM. But I'm afraid that's not very stable (officially supported by Spark team if you want) solution.
  4. [Spark Job Server][5, but as I understand it's an implementations of the first solution

Update

Looks like 2nd option (global session with all resources + fair thread pool) isn't possible due to the fact you can configure only number of cores at pool.xml (minShare), but can't memory per executor.

Community
  • 1
  • 1
VB_
  • 45,112
  • 42
  • 145
  • 293

0 Answers0