I'm trying to use all resources on my EMR cluster.
The cluster itself is 4 m4.4xlarge machines (1 driver and 3 workers) with 16 vCore, 64 GiB memory, EBS Storage:128 GiB
When launching the cluster through the cli I'm presented with following options (all 3 options were executed within the same data pipeline):
Just use "maximizeResourceAllocation" without any other spark-submit parameter
This only gives me 2 executors presented here
Don't put anything, leave spark-defaults to do their job
Gives following low-quality executors
Use AWS's guide on how to configure cluster in EMR
Following this guide, I deduced following spark-submit
parameters:
"--conf",
"spark.executor.cores=5",
"--conf",
"spark.executor.memory=18g",
"--conf",
"spark.executor.memoryOverhead=3g",
"--conf",
"spark.executor.instances=9",
"--conf",
"spark.driver.cores=5",
"--conf",
"spark.driver.memory=18g",
"--conf",
"spark.default.parallelism=45",
"--conf",
"spark.sql.shuffle.partitions=45",
Now I did look everywhere I could on the internet, but couldn't find any explanation on why EMR doesn't use all the resources provided. Maybe I'm missing something or maybe this is expected behaviour, but when "maximizeAllocation" only spans 2 executors on a cluster with 3 workers, there's something wrong there.
UPDATE:
So today while running a different data pipeline I got this using "maximizeResourceAllocation":
Which is much much better then the other ones, but still lacks a lot in terms of used memory and executors (although someone from EMR team said that emr merges executors into super-executors to improve performances).