spark-submit criteria to set parameter values

Question

I am so confused about the right criteria to use when it comes to setting the following spark-submit parameters, for example:

spark-submit --deploy-mode cluster --name 'CoreLogic Transactions Curated ${var_date}' \
--driver-memory 4G --executor-memory 4G --num-executors 10 --executor-cores 4 \
/etl/scripts/corelogic/transactions/corelogic_transactions_curated.py \
--from_date ${var_date} \
--to_date ${var_to_date}

One person is telling me that I am using a lot of executors and cores but he is not explaining why he said that.

Can someone explain to me the right criteria to use when it comes to setting these parameters (--driver-memory 4G --executor-memory 4G --num-executors 10 --executor-cores 4) according to my dataset?

The same in the following case

spark = SparkSession.builder \
    .appName('DemoEcon PEP hist stage') \
    .config('spark.sql.shuffle.partitions', args.shuffle_partitions) \
    .enableHiveSupport() \
    .getOrCreate()

I am not quite sure which is the criteria used to set this parameter "spark.sql.shuffle.partitions"

can someone help me to get this clear in my mind?

Thank you in advance

You can find several SO posts that discuss all this. Examples: [How to set Apache Spark Executor memory](https://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory), https://stackoverflow.com/questions/32349611/what-should-be-the-optimal-value-for-spark-sql-shuffle-partitions-or-how-do-we-i — blackbishop, Jan 19 '21 at 20:30

score 0 · Answer 1 · answered Jan 30 '21 at 02:03

In this website is the answer that I needed, an excellent explanation, with some examples.

http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/

Here is one of those examples:

Case 1 Hardware – 6 Nodes and each node have 16 cores, 64 GB RAM First on each node, 1 core and 1 GB is needed for Operating System and Hadoop Daemons, so we have 15 cores, 63 GB RAM for each node

We start with how to choose number of cores:

Number of cores = Concurrent tasks an executor can run

So we might think, more concurrent tasks for each executor will give better performance. But research shows that any application with more than 5 concurrent tasks, would lead to a bad show. So the optimal value is 5.

This number comes from the ability of an executor to run parallel tasks and not from how many cores a system has. So the number 5 stays same even if we have double (32) cores in the CPU

Number of executors:

Coming to the next step, with 5 as cores per executor, and 15 as total available cores in one node (CPU) – we come to 3 executors per node which is 15/5. We need to calculate the number of executors on each node and then get the total number for the job.

So with 6 nodes, and 3 executors per node – we get a total of 18 executors. Out of 18 we need 1 executor (java process) for Application Master in YARN. So final number is 17 executors

This 17 is the number we give to spark using –num-executors while running from spark-submit shell command

Memory for each executor:

From above step, we have 3 executors per node. And available RAM on each node is 63 GB

So memory for each executor in each node is 63/3 = 21GB.

However small overhead memory is also needed to determine the full memory request to YARN for each executor.

The formula for that overhead is max(384, .07 * spark.executor.memory)

Calculating that overhead: .07 * 21 (Here 21 is calculated as above 63/3) = 1.47

Since 1.47 GB > 384 MB, the overhead is 1.47

Take the above from each 21 above => 21 – 1.47 ~ 19 GB

So executor memory – 19 GB

Final numbers – Executors – 17, Cores 5, Executor Memory – 19 GB

spark-submit criteria to set parameter values

1 Answers1