I'm running airflow 1.9 in kubernetes in aws. I would like the logs to go to s3 as the airflow containers themselves are not long lived.
I've read the various threads and documents which describe the process but I still cannot get it working. First a test that demonstrates to me that the s3 configuration and permissions are valid. This is run on one of our worker instances.
Use airflow to write to an s3 file
airflow@airflow-worker-847c66d478-lbcn2:~$ id
uid=1000(airflow) gid=1000(airflow) groups=1000(airflow)
airflow@airflow-worker-847c66d478-lbcn2:~$ env |grep s3
AIRFLOW__CONN__S3_LOGS=s3://vevo-dev-us-east-1-services-airflow/logs/
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_logs
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://vevo-dev-us-east-1-services-airflow/logs/
airflow@airflow-worker-847c66d478-lbcn2:~$ python
Python 3.6.4 (default, Dec 21 2017, 01:37:56)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import airflow
>>> s3 = airflow.hooks.S3Hook('s3_logs')
/usr/local/lib/python3.6/site-packages/airflow/utils/helpers.py:351: DeprecationWarning: Importing S3Hook directly from <module 'airflow.hooks' from '/usr/local/lib/python3.6/site-packages/airflow/hooks/__init__.py'> has been deprecated. Please import from '<module 'airflow.hooks' from '/usr/local/lib/python3.6/site-packages/airflow/hooks/__init__.py'>.[operator_module]' instead. Support for direct imports will be dropped entirely in Airflow 2.0.
DeprecationWarning)
>>> s3.load_string('put this in s3 file', airflow.conf.get('core', 'remote_base_log_folder') + "/airflow-test")
[2018-02-23 18:43:58,437] {{base_hook.py:80}} INFO - Using connection to: vevo-dev-us-east-1-services-airflow
Now let's retrieve the file from s3 and look at the contents. We can see everything looks good here.
root@4f8171d4fe47:/# aws s3 cp s3://vevo-dev-us-east-1-services-airflow/logs//airflow-test .
download: s3://vevo-dev-us-east-1-services-airflow/logs//airflow-test to ./airflow-test
root@4f8171d4fe47:/# cat airflow-test
put this in s3 fileroot@4f8171d4fe47:/stringer#
So it seems like the airflow s3 connection is good except airflow jobs do not use s3 for logging. Here are the settings I have which I figure something is either wrong or I am missing something.
Env vars of running worker/scheduler/master instances are
airflow@airflow-worker-847c66d478-lbcn2:~$ env |grep -i s3
AIRFLOW__CONN__S3_LOGS=s3://vevo-dev-us-east-1-services-airflow/logs/
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_logs
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://vevo-dev-us-east-1-services-airflow/logs/
S3_BUCKET=vevo-dev-us-east-1-services-airflow
This shows that the s3_logs connection exists in airflow
airflow@airflow-worker-847c66d478-lbcn2:~$ airflow connections -l|grep s3
│ 's3_logs' │ 's3' │ 'vevo-dev-
us-...vices-airflow' │ None │ False │ False │ None │
I put this file https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/airflow_local_settings.py in place in my docker image. You can see an example here on one of our workers
airflow@airflow-worker-847c66d478-lbcn2:~$ ls -al /usr/local/airflow/config/
total 32
drwxr-xr-x. 2 root root 4096 Feb 23 00:39 .
drwxr-xr-x. 1 airflow airflow 4096 Feb 23 00:53 ..
-rw-r--r--. 1 root root 4471 Feb 23 00:25 airflow_local_settings.py
-rw-r--r--. 1 root root 0 Feb 16 21:35 __init__.py
We have edited the file to define the REMOTE_BASE_LOG_FOLDER variable. Here is the diff between our version and the upstream version
index 899e815..897d2fd 100644
--- a/var/tmp/file
+++ b/config/airflow_local_settings.py
@@ -35,7 +35,8 @@ PROCESSOR_FILENAME_TEMPLATE = '{{ filename }}.log'
# Storage bucket url for remote logging
# s3 buckets should start with "s3://"
# gcs buckets should start with "gs://"
-REMOTE_BASE_LOG_FOLDER = ''
+REMOTE_BASE_LOG_FOLDER = conf.get('core', 'remote_base_log_folder')
+
DEFAULT_LOGGING_CONFIG = {
'version': 1,
Here you can see that the setting is correct on one of our workers.
>>> import airflow
>>> airflow.conf.get('core', 'remote_base_log_folder')
's3://vevo-dev-us-east-1-services-airflow/logs/'
Based on the fact that REMOTE_BASE_LOG_FOLDER starts with 's3' and REMOTE_LOGGING is True
>>> airflow.conf.get('core', 'remote_logging')
'True'
I would expect this block https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/airflow_local_settings.py#L122-L123 to evaluate to true and make the logs go to s3.
Please can anyone who has s3 logging working on 1.9 point out what I am missing? I would like to submit a PR to the upstream project to update the docs as this seems to be a pretty common problem and as near as I can tell the upstream documents are not valid or somehow get misinterpreted frequently.
Thanks! G.