0

After specifying a configuration file in a spark-submit as in this answer:

spark-submit \
    --master local \
    --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"\
    --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"\
    --py-files ./dist/src-1.0-py3-none-any.whl\
    --files "/job/log4j.properties"\ # path in docker container
     main.py -input $1 -output $2 -mapper $3 $4 # app args

With the dockerized application structure being:

job/
|--  entrypoint.sh
|--  log4j.properties
|--  main.py

I'm getting the following error:

log4j:ERROR Ignoring configuration file [file:/log4j.properties].log4j:ERROR Could not read configuration file from URL [file:/log4j.properties].

java.io.FileNotFoundException: /log4j.properties (No such file or directory)

It works fine if I set the configuration from the spark context method: PropertyConfigurator.configure:

logger = sc._jvm.org.apache.log4j.Logger
sc._jvm.org.apache.log4j.PropertyConfigurator.configure("/job/log4j.properties")
Logger = logger.getLogger("MyLogger")

That is, all spark INFO level logging is silenced, and I only see warnings and error logs, which is what I've set in the configuration file. However, if I just instanciate a logger as (desirable behaviour):

log4jLogger = sc._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger("MyLogger")

It isn't behaving as at it does setting it via PropertyConfigurator.configure, which I've set to silence all spark INFO level loggings. Any idea on how to use the logging configuration set in the spark-submit to control the application's logs?

Using pyspark with a spark version 3.0.1 and python 3.8.0.

yatu
  • 86,083
  • 12
  • 84
  • 139
  • 1
    Try removing `/job/` from the extraJavaOptions. The file isn't mounted in a job folder unless you pass `--files "/job/log4j.properties:/job/log4j.properties"` – OneCricketeer May 13 '21 at 14:51
  • Thanks @OneCricketeer . Still seems to behave in the same way though – yatu May 13 '21 at 15:05
  • Have you try to make it `file:///job/log4j.properties`? – pltc May 18 '21 at 05:15
  • Also, can you show an example of your `log4j.properties`, and how do you know `It doesn't seem to pick up the configuration`? – pltc May 18 '21 at 15:31
  • The configuration file should work fine. I've tested it directly from the spark context in the application using the method `PropertyConfigurator.configure`, where I've set all spark loggings to WARN or ERROR level, and `MyLogger` to INFO level, which silenced all other loggings. I don't get the same behaviour setting it through the spark-submit @pltc – yatu May 19 '21 at 06:41
  • I've tried to set both configuration paths to `--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///log4j.properties"` and still the same. I understand that as @OneCricketeer mentioned the `/job/` is unnecessary – yatu May 19 '21 at 06:53
  • `file:///foo.bar` isn't correct either (for a clustered environment) unless you distributed the same file to every executor and `foo.bar` is at the root. You could try `file:///path/to/foo.bar`, assuming it is a static pre-distributed file, but don't think that's what's wanted here. From my understanding, `files` doesn't work with a local master because there is no data to distribute and no temporary worker directory is allocated. Therefore, you're going to need the full path here – OneCricketeer May 19 '21 at 12:20
  • That did it @OneCricketeer ! Just got it to work specifying the paths as `file:///job/log4j.properties`, and `--files "/job/log4j.properties"`. Feel free to write the comment as an answer and I'll accept – yatu May 19 '21 at 14:46

1 Answers1

2

Since you're in a container and using --master local, that would limit you to a local filesystem, which you can access from file:// URI.

--files takes relative paths to where the files are located from where you run the command, and adds to the driver/executor classpath, I think

Putting those two pieces of information together, you can specify

-Dlog4j.configuration=file:///job/log4j.properties

Along with

--files "/job/log4j.properties"

If you were to run this in a clustered environment, however, then the -Dlog4j.configuration settings would not be correct

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245