I am so confused about the right criteria to use when it comes to setting the following spark-submit parameters, for example:
spark-submit --deploy-mode cluster --name 'CoreLogic Transactions Curated ${var_date}' \
--driver-memory 4G --executor-memory 4G --num-executors 10 --executor-cores 4 \
/etl/scripts/corelogic/transactions/corelogic_transactions_curated.py \
--from_date ${var_date} \
--to_date ${var_to_date}
One person is telling me that I am using a lot of executors and cores but he is not explaining why he said that.
Can someone explain to me the right criteria to use when it comes to setting these parameters (--driver-memory 4G --executor-memory 4G --num-executors 10 --executor-cores 4) according to my dataset?
The same in the following case
spark = SparkSession.builder \
.appName('DemoEcon PEP hist stage') \
.config('spark.sql.shuffle.partitions', args.shuffle_partitions) \
.enableHiveSupport() \
.getOrCreate()
I am not quite sure which is the criteria used to set this parameter "spark.sql.shuffle.partitions"
can someone help me to get this clear in my mind?
Thank you in advance