Dataproc: configure Spark driver and executor log4j properties

Question

As explained in previous answers, the ideal way to change the verbosity of a Spark cluster is changing the corresponding log4j.properties. However, on dataproc Spark runs on Yarn, therefore we have to adjust the global configuration and not /usr/lib/spark/conf

Several suggestions:

On dataproc we have several gcloud commands and properties we can pass during cluster creation. See documentation Is it possible to change the log4j.properties under /etc/hadoop/conf by specifying

--properties 'log4j:hadoop.root.logger=WARN,console'

Maybe not, as from the docs:

The --properties command cannot modify configuration files not shown above.

Another way would be to use a shell script during cluster init and run sed:

# change log level for each node to WARN
sudo sed -i -- 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g'\
                     /etc/spark/conf/log4j.properties
sudo sed -i -- 's/hadoop.root.logger=INFO,console/hadoop.root.logger=WARN,console/g'\
                    /etc/hadoop/conf/log4j.properties

But is it enough or do we need to change the env variable hadoop.root.logger as well?

The 2nd way actually works for me, but I still wonder if there's a better way without editing the config files which might change over time and releases. — Frank, Mar 23 '16 at 11:46

score 3 · Accepted Answer · edited Sep 07 '22 at 18:12

This answer is outdated as of Q3 2022, check the answer below to get the latest info

At the moment, you're right that --properties doesn't support extra log4j settings, but it's certainly something we've talked about adding; some considerations include how much to balance the ability to do fine-grained control over Spark vs Yarn vs other long-running daemons' logging configs (hiveserver2, HDFS daemons, etc) compared to keeping a minimal/simple setting which is plumbed through to everything in a shared way.

At least for Spark driver logs, you can use the --driver-log-levels setting a job-submission time which should take precedence over any of the /etc/*/conf settings, but otherwise as you describe, init actions are a reasonable way to edit the files for now on cluster startup, keeping in mind that they may change over time and releases.

Is there any way to use the `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` within `--properties` to define the `-Dlog4j.configuration` to use a `log4j.properties` file either located as a resource in my jar or, better yet, located in `gs://`? — Danny Varod, Jan 14 '20 at 10:57

score 2 · Answer 2 · answered Jul 07 '21 at 18:33

Recently, the support for log4j properties have been added via the --properties tag. For example: you can now use "--properties 'hadoop-log4j:hadoop.root.logger=WARN,console'". See this page(https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties) for more details

Dagang · Answer 3 · 2022-10-01T00:48:10.333

Updated in Q3 2022

Default config

The default log4j config for Spark on Dataproc is available at /etc/spark/conf/log4j.properties. It configures root logger to stderr at INFO level. But at runtime driver logs (in client mode) will be directed by the Dataproc agent to GCS and streamed back to the client, and executor logs (and driver logs in cluster mode) will be redirected by YARN to the stderr file in the container's YARN log dir. See this answer for how to get YARN container logs of Dataproc.

Consider using /etc/spark/conf/log4j.properties as the template for your custom config, and keep using console as the target for your log.

Cluster level

If you want to configure Spark driver and executor logs at cluster level, the simplest way is to add --properties spark-log4j:<key>=<value>,... when creating the cluster. The properties from the flag will be appended to /etc/spark/conf/log4j.properties which will be used as the default log4j config for all Spark drivers and executors in the cluster. Or you can write an init action to update the file.

Job level

You can also configure Spark driver and/or executor logs at job level when submitting the job with either of the following ways:

--driver-log-levels (for driver only), for example:

gcloud dataproc jobs submit spark ...\
  --driver-log-levels root=WARN,org.apache.spark=DEBUG

--files. If the driver and executor can share the same log4j config, then gcloud dataproc jobs submit spark ... --files gs://my-bucket/log4j.properties will be the easiest. Note that the file name should be exactly log4j.properties, so it can override the default one.
--files and --properties spark.[driver|executor].extraJavaOptions=-Dlog4j.configuration= (for both driver and executor). Note that -Dlog4j.configuration should be set to file:<filename> because the files will be present in the working directory of the YARN container for driver/executor.

gcloud dataproc jobs submit spark ... \
  --files gs://my-bucket/driver-log4j.properties,gs://my-bucket/executor-log4j.properties \
  --properties 'spark.driver.extraJavaOptions=-Dlog4j.configuration=file:driver-log4j.properties,spark.executor.extraJavaOptions=-Dlog4j.configuration=file:executor-log4j.properties'

See also https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application

Dataproc: configure Spark driver and executor log4j properties

3 Answers3

Default config

Cluster level

Job level

Linked