1

I am working on a Dockerized Python/Django project including a container for Celery workers, into which I have been integrating the off-the-shelf airflow docker containers.

I have Airflow successfully running celery tasks in the pre-existing container, by instantiating a Celery app with the redis broker and back end specified, and making a remote call via send_task; however none of the logging carried out by the celery task makes it back to the Airflow logs.

Initially, as a proof of concept as I am completely new to Airflow, I had set it up to run the same code by exposing it to the Airflow containers and creating airflow tasks to run it on the airflow celery worker container. This did result in all the logging being captured, but it's definitely not the way we want it architectured, as this makes the airflow containers very fat due to the repetition of dependencies from the django project.

The documentation says "Most task handlers send logs upon completion of a task" but I wasn't able to find more detail that might give me a clue how to enable the same in my situation.

Is there any way to get these logs back to airflow when running the celery tasks remotely?

mshedden
  • 11
  • 3
  • 1
    The answer of your question is a combination of [this](https://stackoverflow.com/a/68198920/4137497) and [this](https://stackoverflow.com/a/73080610/4137497) answers. – tsveti_iko Jul 27 '22 at 12:42

1 Answers1

0

Instead of "returning the logs to Airflow", an easy-to-implement alternative (because Airflow natively supports it) is to activate remote logging. This way, all logs from all workers would end up e.g. on S3, and the webserver would automatically fetch them.

The following illustrates how to configure remote logging using an S3 backend. Other options (e.g. Google Cloud Storage, Elastic) can be implemented similarly.

  1. Set remote_logging to True in airflow.cfg
  2. Build an Airflow connection URI. This example from the official docs is particularly useful IMO. One should end up having something like:
aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY@/? 
endpoint_url=http%3A%2F%2Fs3%3A4566%2F

        It is also possible to create the connectino through the webserver GUI, if needed.

  1. Make the connection URI available to Airflow. One way of doing so is to make sure that the environment variable AIRFLOW_CONN_{YOUR_CONNECTION_NAME} is available. Example for connection name REMOTE_LOGS_S3:
export AIRFLOW_CONN_REMOTE_LOGS_S3=aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY@/?endpoint_url=http%3A%2F%2Fs3%3A4566%2F
  1. Set remote_log_conn_id to the connection name (e.g. REMOTE_LOGS_S3) in airflow.cfg
  2. Set remote_base_log_folder in airflow.cfg to the desired bucket/prefix. Example:
remote_base_log_folder = s3://my_bucket_name/my/prefix

This related SO touches deeper on remote logging.

If debugging is needed, looking into any worker logs locally (i.e., inside the worker) should help.

swimmer
  • 1,971
  • 2
  • 17
  • 28