Problem explanation
Suppose you have Spark cluster with Standalone manager, where jobs are scheduled through SparkSession
created at client app. Client app runs on JVM. And you have to launch each job with different configs for the sake of performance, see Job types example below.
The problem is you can't create two sessions from single JVM.
So how you gonna launch multiple Spark jobs with different session configs simultaneously?
By different session configs I mean:
spark.executor.cores
spark.executor.memory
spark.kryoserializer.buffer.max
spark.scheduler.pool
- etc
My thoughts
Possible ways to solve the problem:
- Set different session configs for each Spark job within the same
SparkSession
. Is it possible? - Launch another JVM just to start another
SparkSession
, something that I could call Spark session service. But you never knew how many jobs with different configs you gonna launch in future simultaneously. At the moment - I need only 2-3 different configs at a time. It's may be enough but not flexible. - Make global session with the same configs for all kinds of jobs. But this approach is a bottom from perspective of performance.
- Use Spark only for heavy jobs, and run all quick search tasks outside Spark. But that's a mess, since you need to keep another solution (like Hazelcast) in parallel with Spark, and split resources between them. Moreover, that brings extra complexity for all: deployment, support etc.
Job types example
- Dump huge database task. It's CPU low but IO intensive long running task. So you may want to launch as many executors as you can with low memory and cores per executor.
- Heavy handle-dump-results task. It's CPU intensive so you gonna launch one executor per cluster machine, with maximum CPU and cores.
- Quick retrieve data task, which requires one executor per machine and minimal resources.
- Something in a middle between 1-2 and 3, where a job should take a half of cluster resources.
- etc.