2

I've seen several bits on emitting logs from PythonOperator, and for configuring Airflow logs, but haven't found anything that will let me emit logs from within a containerized process, e.g. the DataProcPySparkOperator.

I've gone so far as to including the following at the top of the pyspark script that is run inside the Operator's cluster:

import logging
logging.info('Test bare logger')
for ls in ['airflow', 'airflow.task', __name__]:
    l = logging.getLogger(ls)
    l.info('Test {} logger'.format(ls))
print('Test print() logging')

It produces no output, although the Operator script otherwise runs as intended.

I assume that I could build a connection to cloud storage (or the DB) from within the cluster, perhaps piggybacking off the existing connection used to read & write files, but ... that seems like a lot of work for a common need. I would very much like to get occasionally-referenced status checks about the number of records or other data at intermediate stages of the computation.

Does Airflow set up a Python logger in the cluster by default? If so, how do I access it?

Sarah Messer
  • 3,592
  • 1
  • 26
  • 43
  • 1
    Is this script running on GCP or local cluster and where do you want to emit the logs to? One possible way to export general python logging to GCP logging explorer is - https://cloud.google.com/logging/docs/setup/python – Henry Gong Dec 09 '20 at 00:45
  • Script is running on GCP. Thanks for the link. Will look it over. I'd like the logs to be in with Airflow's normal task-execution logs, if possible. – Sarah Messer Dec 09 '20 at 14:54

1 Answers1

3

If you use Cloud Composer to run Airflow, you should know that Cloud Composer includes only Airflow logs and Streaming logs.

When you run PySpark Job in a Dataproc Cluster, job driver output is stored in Cloud Storage (see Accessing job driver output)

You can also enable Dataproc to save job driver logs in Cloud Logging.

As described in Dataproc documentation, to enable job driver logs in Logging, set the following cluster property when creating the cluster:

dataproc:dataproc.logging.stackdriver.job.driver.enable=true

The following cluster properties are also required, and are set by default when a cluster is created:

dataproc:dataproc.logging.stackdriver.enable=true
dataproc:jobs.file-backed-output.enable=true
  • This feels like a partial solution to me: I don't have Python logging set up inside the cluster, but was hoping that Airflow would also configure that. I was hoping there was some parameter I could pass to Airflow that would tell it to set up Python logging, but that might not be an implemented feature... – Sarah Messer Dec 16 '20 at 20:10
  • I'm now getting log messages (in GCP AI Platform logs, but not in Airflow logs), with _both_ print() and with `logger=logging.getLogger('projects/MyProjectName/logs/master-replica-0')` I don't think I changed anything, but I'm also surprised I didn't see my test messages before... *sigh* Going to accept this answer... – Sarah Messer Jan 04 '21 at 22:46