Pyspark Logging: Printing information at the wrong log level

Question

Thanks for your time!

I'd like to create and print legible summaries of my (hefty) data to my output when debugging my code, but stop creating and printing those summaries once finished to speed things up. I was advised to use logging, which I implemented. It works as expected for printing text strings as messages to the output - however when printing summaries of dataframes it seems to ignore the log level, creating them and printing them always.

Is logging the right this to be using or is there a better way to do this? I could #block the lines of code or use if statements etc, but it's a massive code and I know that I'll need to make the same checks in the future as further elements are added - seems exactly like what logging should work for.

from pyspark.sql.functions import col,count
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

df = spark.createDataFrame([(1,2),(3,4)],["COLA","COLB"])

print "1"
logger.setLevel(logging.DEBUG)
logger.debug("1 - DEBUG - Print the message and show the table")
logger.debug(df.show())

print "2"
logger.setLevel(logging.INFO)
logger.debug("2 - INFO - Don't print the message or show the table")
logger.debug(df.show())

print "3"
logger.setLevel(logging.INFO)
logger.debug("3 - INFO - Don't print the message or show the collected data")
logger.debug(df.collect())

print "4"
logger.setLevel(logging.DEBUG)
logger.debug("4 - DEBUG - Print the message and the collected data")
logger.debug(df.collect())

Output:

1
DEBUG:__main__:1 - DEBUG - Print the message and show the table
+----+----+
|COLA|COLB|
+----+----+
|   1|   2|
|   3|   4|
+----+----+
DEBUG:__main__:None
2
+----+----+
|COLA|COLB|
+----+----+
|   1|   2|
|   3|   4|
+----+----+
3
4
DEBUG:__main__:4 - DEBUG - Print the message and the collected data
DEBUG:__main__:[Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]

score 3 · Accepted Answer · answered May 13 '20 at 19:30

Logging working as expected, if we are using df.show() (or) df.collect() are actions as spark executes then even though they are in logger.debug.

If we set log level to DEBUG then we can see INFO level logs.
If we set log level to INFO then we can't see DEBUG level logs.

One workaround you can do is by storing collect()/take(n) result to a variable then use the variable in your logging.

from pyspark.sql.functions import col,count
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

df = spark.createDataFrame([(1,2),(3,4)],["COLA","COLB"])

#storing results but don't use collect on huge dataset instead use `.take`
res=df.collect() 

#get 10 records from df
res=df.take(10)

print "1"
#1
logger.setLevel(logging.DEBUG)
logger.debug("1 - DEBUG - Print the message and show the table")
#DEBUG:__main__:1 - DEBUG - Print the message and show the table
logger.debug(res)
#DEBUG:__main__:[Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]

print "2"
#2
logger.setLevel(logging.INFO)
logger.debug("2 - INFO - Don't print the message or show the table")
logger.debug(res) #this won't print as loglevel is INFO.
logger.info("result: " + str(res)) #this will get printed out
#INFO:__main__:result: [Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]

Use .take instead of .collect().

Thanks Shu, really helpful - my expectations of logging were clearly too high :-) Confirming my understanding: 1. Print/not print issue solved 2. Execute/not execute issue solved - logger.debug(res) would only execute the df.collect() at the appropriate logging level. >> #INFO:__main__:result: [Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)] > Is it possible to print the result in a tabular format? Something easier to read as you QA your code? My closest solution (it's terrible) thus far would be to print each row in a for loop but ... that can't be it. — Amit, May 14 '20 at 12:43
@Amit, I can't think any other way we can print out in tabular format except of `.collect()` even for loop also prints Row objects not as `dataframe`.. but we can Row objects also readable format having `key=value`. — notNull, May 15 '20 at 02:21
Shu - Thanks for your help, mate. Much appreciated. In the end I went with just creating a class of qaFunctions with an "if self.level <=" at the start of each method. In a way it feels like a cleaner solution since I'll be able to record every single check I ever did and switch them on/off independently when I need to debug again in the future. — Amit, May 15 '20 at 14:20

Pyspark Logging: Printing information at the wrong log level

1 Answers1