I am using Dataproc to run my pyspark jobs. Following are the three ways that I can submit my jobs:
dataproc submit
commandspark-submit
utility provided by spark- For small experimentations I can also use spark-shell
Now, I have to modify a few env variables. For instance SPARK_HOME
.
For dataproc submit
I have options to modify env variables separately for driver and workers. Specifically, I can use spark.executorEnv.[Name]
to set env variables on workers and spark.yarn.appMasterEnv.[NAME]
to set driver env variables.
For the spark-submit
utility and spark-shell, I can submit the spark job after exporting env variables on master only. export $SPARK_HOME='path'
and then it works fine.
I want to understand what spark is doing under the hood for env variables. Are the env variables set on workers the same as the master and overridden only if they are explicitly overridden by being set on the worker nodes? Why would we need different env variables for driver and workers?