3

I am using Dataproc to run my pyspark jobs. Following are the three ways that I can submit my jobs:

  1. dataproc submit command
  2. spark-submit utility provided by spark
  3. For small experimentations I can also use spark-shell

Now, I have to modify a few env variables. For instance SPARK_HOME.

For dataproc submit I have options to modify env variables separately for driver and workers. Specifically, I can use spark.executorEnv.[Name] to set env variables on workers and spark.yarn.appMasterEnv.[NAME] to set driver env variables.

For the spark-submit utility and spark-shell, I can submit the spark job after exporting env variables on master only. export $SPARK_HOME='path' and then it works fine.

I want to understand what spark is doing under the hood for env variables. Are the env variables set on workers the same as the master and overridden only if they are explicitly overridden by being set on the worker nodes? Why would we need different env variables for driver and workers?

Ken White
  • 123,280
  • 14
  • 225
  • 444
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
  • Env variables are process level. I don't think Spark has any mechanism of env variable propagation. – Dagang Jan 07 '22 at 03:54
  • But, then why am I able to run the spark code successfully by just exporting the env variable on the driver node? shouldn't I have to explicitly set the env variables on the worker nodes as well? https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html here also the suggested way is to just set them on the driver and the workers work just fine – figs_and_nuts Jan 07 '22 at 04:00
  • I checked the article you linked, but I don't see it shows "set them (env variables) on the driver and the workers work just fine". If you are talking about `PYSPARK_PYTHON`, PySpark might have handled it specifically, it is not a general mechanism for all env variables. – Dagang Jan 10 '22 at 02:56
  • Specific to the article I was talking about the ```PYSPARK_PYTHON``` only. As I mentioned in the question, I can also set ```SPARK_HOME``` only on the driver and I can run my spark jobs successfully. In fact, I am setting these 4 env variables only on driver to run pyspark 3.2.0 on dataproc ```SPARK_HOME```, ```HADOOP_CONF_DIR```, ```SPARK_CONF``` and ```PYSPARK_PYTHON```. It does not run if I don't set these – figs_and_nuts Jan 10 '22 at 04:49
  • For running a Dataproc job, the environment variables can be passed with the gcloud dataproc job submit command. Could you clarify why you want to set the environment variables inside worker nodes? – Shipra Sarkar Jan 10 '22 at 11:28
  • @ShipraSarkar- How can dataproc submit be used to pass the env variables in client deploy-mode? I have asked a similar question https://stackoverflow.com/questions/70612406/how-to-pass-env-variables-in-dataproc-submit-command/70614755 .. would be great if you can answer that. dataproc is not up to date with pyspark 3.2.0 and I want to use pandas on pyspark that got shipped with pyspark on pandas. So, I am creating a cluster with an env that has pyspark 3.2 installed and changing the env variables so that spark from within the env gets used rather than the global spark 3.1 version – figs_and_nuts Jan 10 '22 at 14:53

0 Answers0