34

I want my Spark driver program, written in Python, to output some basic logging information. There are three ways I can see to do this:

  1. Using the PySpark py4j bridge to get access to the Java log4j logging facility used by Spark.

log4jLogger = sc._jvm.org.apache.log4j LOGGER = log4jLogger.LogManager.getLogger(__name__) LOGGER.info("pyspark script logger initialized")

  1. Just use standard console print.

  2. logging Python standard library module. This seems ideal and the most Pythonic approach, however, at least out of the box, it doesn't work and logged messages don't seem to be recoverable. Of course, this can be configured to log to py4j->log4j and/or to console.

So, the official programming guide (https://spark.apache.org/docs/1.6.1/programming-guide.html) doesn't mention logging at all. That's disappointing. There should be standard documented recommended way to log from a Spark driver program.

searched for this issue, and found this: How do I log from my Python Spark script

But the contents of that thread were unsatisfactory.

Specifically, I have the following questions:

  • Am I missing a standard way to log from a PySpark driver program?
  • Are there any pros/cons to logging to py4j->log4j vs console?
Community
  • 1
  • 1
clay
  • 18,138
  • 28
  • 107
  • 192
  • 2
    Did you ever get an answer to this? It seems to me 1 is the only solution. 2 didn't work for me (I was surprised by this). Looking at the other question you linked to, I don't thing `logging.getLogger('py4j')` works, since py4j doesn't use the log4j logger, it uses java.utils.logger. – dragonx Jun 22 '16 at 00:58
  • 1
    The conclusion I reached is Spark basically doesn't intend for custom Spark jobs to do their own logging. Usually, you write a driver program that encodes a fairly simple workflow and Spark itself does the heavy lifting and provides various monitoring and diagnostic tools to use. – clay Jun 22 '16 at 01:08
  • 1
    Are you submitting your `.py` to a cluster or in a standalone machine? Do you want to log events from each worker or only on your client? – Jocer Nov 15 '16 at 13:33

3 Answers3

6

A cleaner solution is to use standard python logging module with a custom distributed handler to collect log messages from all nodes of the spark cluster.

See "Logging in PySpark" of this Gist.

Tshilidzi Mudau
  • 7,373
  • 6
  • 36
  • 49
user1944010
  • 71
  • 1
  • 3
  • 1
    Forgot to mention: this is the example of how it's done https://github.com/zillow/sqs-log4j-handler – user1944010 Aug 30 '19 at 15:27
  • And, of course, you can use the same technique to forward messages to log4j under Spark: https://gist.github.com/thsutton/65f0ec3cf132495ef91dc22b9bc38aec –  Jan 11 '21 at 00:48
0

There doesn't seem to be a standard way to log from a PySpark driver program, but using the log4j facility through the PySpark py4j bridge is recommended. Logging to the console is simple, but log4j provides more advanced logging features and is used by Spark. You can also set things up so that it logs to both, which can be helpful for debugging.

-2

In my python dev environment (single machine Spark setup) I use this:

import logging


def do_my_logging(log_msg):

    logger = logging.getLogger(__FILE__)
    logger.warning('log_msg = {}'.format(log_msg))

do_my_logging('Some log message')

which works using the spark-submit script.

Ytsen de Boer
  • 2,797
  • 2
  • 25
  • 36
  • 2
    If you -1 this, it would be good to add in the comments the reason why ... if this is an inappropriate answer, then I will delete it. – Ytsen de Boer Feb 24 '18 at 13:31
  • You mean on multiple machines on a cluster? OP asks about logging from the driver program. Don't use this setup for logging on the cluster. – Ytsen de Boer Mar 29 '18 at 10:53
  • I get: No handlers could be found for logger `__FILE__` – hoaphumanoid Apr 25 '18 at 11:22
  • You should remove the apostrophes. Now it wil be looking for a logger with that exact string as name. Maybe also try to replace "FILE" with (lowercase) "name" (not sure if that matters, though). – Ytsen de Boer Apr 25 '18 at 14:42
  • @YtsendeBoer can you explain what you mean with "do not use this setup for logging on the cluster"? The question was in the context of spark. If you want to log a dataframe being dumped to Hive, would this be a good solution or not because of the paralelism? – monkey intern Sep 05 '19 at 11:03
  • Yes, this only works if everything is done on the same machine. Useful for debugging in the development stage. – Ytsen de Boer Sep 05 '19 at 11:09