I want my Spark driver program, written in Python, to output some basic logging information. There are three ways I can see to do this:
- Using the PySpark py4j bridge to get access to the Java log4j logging facility used by Spark.
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("pyspark script logger initialized")
Just use standard console print.
logging
Python standard library module. This seems ideal and the most Pythonic approach, however, at least out of the box, it doesn't work and logged messages don't seem to be recoverable. Of course, this can be configured to log to py4j->log4j and/or to console.
So, the official programming guide (https://spark.apache.org/docs/1.6.1/programming-guide.html) doesn't mention logging at all. That's disappointing. There should be standard documented recommended way to log from a Spark driver program.
searched for this issue, and found this: How do I log from my Python Spark script
But the contents of that thread were unsatisfactory.
Specifically, I have the following questions:
- Am I missing a standard way to log from a PySpark driver program?
- Are there any pros/cons to logging to py4j->log4j vs console?