One gotcha is print()
statements will not make it to the log file, so you need to use the spark logging functionality. In pyspark, I created a utility function that sends the output to a log file, but also prints it to the notebook when the notebook is run manually:
# utility method for logging
log4jLogger = sc._jvm.org.apache.log4j
# give a meaningful name to your logger (mine is CloudantRecommender)
LOGGER = log4jLogger.LogManager.getLogger("CloudantRecommender")
def info(*args):
print(args) # sends output to notebook
LOGGER.info(args) # sends output to kernel log file
def error(*args):
print(args) # sends output to notebook
LOGGER.error(args) # sends output to kernel log file
Using the function like so in my notebook:
info("some log output")
If I check the log files I can see my logout is getting written:
! grep 'CloudantRecommender' $HOME/logs/notebook/*pyspark*
kernel-pyspark-20170105_164844.log:17/01/05 10:49:08 INFO CloudantRecommender: [Starting load from Cloudant: , 2017-01-05 10:49:08]
kernel-pyspark-20170105_164844.log:17/01/05 10:53:21 INFO CloudantRecommender: [Finished load from Cloudant: , 2017-01-05 10:53:21]
Exceptions don't appear to get sent to the log file either, so you will need to wrap code in a try block and log the error, e.g.
import traceback
try:
# your spark code that may throw an exception
except Exception as e:
# send the exception to the spark logger
error(str(e), traceback.format_exc(), ts())
raise e
CAUTION: Another gotcha that hit me during debugging is that scheduled jobs run a specific version of a notebook. Check that you update the schedule job when you save new versions of your notebook.