I am using Python to implement spark jobs. We wanted to get the python logging output from the application into Spark history server. So we used the method outlined here:
PySpark logging from the executor
However the problem is that, since the yarn_logger initialization is only happening in the driver, the executors still run with python logging level of WARNING which means no logs show up for the executor.
In my driver I do the following:
if __name__=='__main__':
# initialize logging in main
yarn_logger.YarnLogger.setup_logger()
And in the other python files, I just intialize python logging module:
import logging
LOG = logging.getLogger(__name__)
But this only results in logs that occur in the driver context from showing up.
How do I architect it so that, yarn_logger is only initialized once per process, no matter whether the application is running in local mode or cluster mode? I can of course, initialize yarn_logger in each python module of my application, but it might cause it to initialize multiple times in the application if I run it in local mode.