8

I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below

Cluster usage

Based on some other questions like this and this, i have setup the cluster to use DominantResourceCalculator to consider both vcpus and memory for resource allocation

gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

But when i submit my job with custom spark flags, looks like YARN doesn't respect these custom parameters and defaults to using memory as the yardstick for resource calculation

gcloud dataproc jobs submit pyspark --cluster cluster_name \
--properties spark.sql.broadcastTimeout=900,spark.network.timeout=800\
,yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator\
,spark.dynamicAllocation.enabled=true\
,spark.executor.instances=10\
,spark.executor.cores=14\
,spark.executor.memory=15g\
,spark.driver.memory=50g \
src/my_python_file.py 

Can help somebody figure out what's going on here?

borarak
  • 1,130
  • 1
  • 13
  • 24

2 Answers2

9

What I did wrong was to add the configuration yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator to YARN instead of the capacity-scheduler.xml (as it should be rightly) while cluster creation

Secondly, i changed yarn:yarn.scheduler.minimum-allocation-vcores which was initially set to 1.

I'm not sure if either one of these or both of these changes led to the solution (i will update soon). My new cluster creation looks like below:

gcloud dataproc clusters create cluster_name --bucket="profiling-
job-default" \
--zone=europe-west1-c \
--master-boot-disk-size=500GB \
--worker-boot-disk-size=500GB \
--master-machine-type=n1-standard-16 \
--num-workers=10 \
--worker-machine-type=n1-standard-16 \
--initialization-actions gs://custom_init_gcp.sh \
--metadata MINICONDA_VARIANT=2 \
--properties=^--^yarn:yarn.scheduler.minimum-allocation-vcores=4--capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
borarak
  • 1,130
  • 1
  • 13
  • 24
  • I tried it. You don't need yarn:yarn.scheduler.minimum-allocation-vcores=4. It's the capacity-scheduler that does the trick! – Harikrishnan Balachandran Aug 26 '22 at 13:46
  • Don't think this solution works. I am running spark 3.3 on yarn on dataproc and switching to dominant resource calculator doesn't change the cores per executor that yarn GUI reports – figs_and_nuts Jun 01 '23 at 10:24
0

First, as you have dynamic allocation enabled, you should set the properties spark.dynamicAllocation.maxExecutors and spark.dynamicAllocation.minExecutors (see https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation)

Second, make sure you have enough partitions in your spark job. As you are using dynamic allocation, yarn only allocates just enough executors to match the number of tasks (partitions). So check SparkUI whether your jobs (more specific : stages) have more than tasks than you have vCores available

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • Thank you for an answer. The link says that `spark.dynamicAllocation.minExecutors` is **relevant** however doesn't say they impact resource allocation. Can you comment more on this please? Secondly, my job had more than enough partitions (~3K) and increasing them only slowed things and decreasing led to memory issues. I posted an answer which worked for me. – borarak Jun 14 '17 at 14:22