How to limit to ERROR in pyspark by overwritting log4j.properties file on Cloud Dataproc?

Question

I am on GCP and have been reading different posts about controlling pyspark logging and I couldn't get anything to work until I found a post about overwriting the log4j.properties file. This is kind of working, except that I am getting errors about multiple classpaths for log4j like this:

[2019-09-23 20:38:48.495]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 23850 Aborted                 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx38168m '-Dflogger.backend_factory=com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance' -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/davidsc/appcache/application_1569259407961_0001/container_1569259407961_0001_01_000400/tmp '-Dspark.driver.port=33949' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1569259407961_0001/container_1569259407961_0001_01_000400 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@davidsc-prod-m.c.unity-ads-ds-prd.internal:33949 --executor-id 368 --hostname davidsc-prod-w-0.c.unity-ads-ds-prd.internal --cores 8 --app-id application_1569259407961_0001 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/davidsc/appcache/application_1569259407961_0001/container_1569259407961_0001_01_000400/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/davidsc/appcache/application_1569259407961_0001/container_1569259407961_0001_01_000400/tensorflow-hadoop-1.6.0.jar > /var/log/hadoop-yarn/userlogs/application_1569259407961_0001/container_1569259407961_0001_01_000400/stdout 2> /var/log/hadoop-yarn/userlogs/application_1569259407961_0001/container_1569259407961_0001_01_000400/stderr
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

I followed the link on the second to last log entry, but didn't find it very helpful.

I'm trying to reduce logging to just errors. Following the post I found, I am uploading a log4j.properties file with the following content:

# Set everything to be logged to the console
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=ERROR
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR

# Reduce verbosity for other spammy core classes.
log4j.logger.org.apache.spark=ERROR
log4j.logger.org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter=ERROR
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=ERROR
log4j.logger.org.spark-project.jetty.server.handler.ContextHandler=ERROR

# Spark 2.0 specific Spam
log4j.logger.org.spark_project.jetty.server.handler.ContextHandler=ERROR

# from https://stackoverflow.com/questions/40608412/how-can-set-the-default-spark-logging-level
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=ERROR

I added that last thing about the Python API, maybe that is pulling in some conflicting logger? It seemed like a good thing to add.

In any case, what I'm uploading is just a modification of what I've found here:

/usr/lib/spark/conf/log4j.properties

Which is a soft link to:

./etc/spark/conf.dist/log4j.properties

I do the copy in my Cloud Dataproc cluster initialization. If I ssh to a worker and look at all the log4j.properties files I find:

 find . 2>/dev/null | grep log4j.prop 2>/dev/null
./etc/pig/conf.dist/log4j.properties.template
./etc/pig/conf.dist/test-log4j.properties
./etc/hadoop/conf.empty/log4j.properties
./etc/zookeeper/conf.dist/rest/log4j.properties
./etc/zookeeper/conf.dist/log4j.properties
./etc/spark/conf.dist/log4j.properties.template
./etc/spark/conf.dist/log4j.properties
./ump-ltv-spark-log4j.properties.error

and I can see that ./etc/spark/conf.dist/log4j.properties is my new ERROR only logger.

I hope I don't have to modify the ./etc/hadoop/conf.empty/log4j.properties file, that has a lot more stuff in it.

Following another link, I can look at my spark launch command:

~$ SPARK_PRINT_LAUNCH_COMMAND=1 spark-shell
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/jars/*:/etc/hadoop/conf/:/etc/hive/conf/:/usr/share/java/mysql.jar:/usr/local/share/google/dataproc/lib/* -Dscala.usejavacp=true -Dscala.usejavacp=true -Xmx106496m -Dflogger.backend_factory=com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell

That makes me think that GCP has put in some backend logger? Maybe that is where these messages are coming from.

It doesn't look like you posted full error output, may you post full content of the `prelaunch.err` and `stderr` files? — Igor Dvorzhak, Oct 26 '19 at 21:40

How to limit to ERROR in pyspark by overwritting log4j.properties file on Cloud Dataproc?

0 Answers0