Question
What is the motivation of creating multiple Spark applications/sessions instead of sharing a global session?
Explanation
You have Spark Standalone cluster manager.
Cluster:
- 5 machines
- 2 cores (executors) each = totally 10 executors
- 16 GB RAM each machine
Jobs:
- Dump database, requires all (10) executors, but only 1 GB RAM on each executor.
- Handle dump results, requires 5 executors with 8-16 GB RAM each.
- Quick data retrieval task, 5 executors with 1 GB RAM each.
- etc
Which solution is a best practice? Why I should ever prefer 1st solution over 2nd, or 2nd over 1st if the resource of the cluster remains the same?
Solutions:
- Launch 1st, 2nd and 3rd jobs from different Spark applications (JVMs).
- Use single global Spark application/session, which holds all resources of the cluster (10 executors, each 8 GB RAM). Create fair scheduler pool for 1st, 2nd and 3rd jobs.
- Use some hacks like this to run jobs with different configs from single JVM. But I'm afraid that's not very stable (officially supported by Spark team if you want) solution.
- [Spark Job Server][5, but as I understand it's an implementations of the first solution
Update
Looks like 2nd option (global session with all resources + fair thread pool) isn't possible due to the fact you can configure only number of cores at pool.xml (minShare
), but can't memory per executor.