1

We started to implement Airflow for task scheduling about a year ago, and we are slowly migrating more and more tasks to Airflow. At some point we noticed that the server was filling up with logs, even after we implemented remote logging to S3. I'm trying to understand what the best way to handle logs is, and I've found a lot of conflicting advice, such as in this stackoverflow question from 4 years ago.

  • Implementing maintenance dags to clean out logs (airflow-maintenance-dags)
  • Implementing our own FileTaskHandler
  • Using the logrotate linux utility

When we implemented remote logging, we expected the local logs to be removed after they were shipped to S3, but this is not the case. Local logs remain on the server. I thought this might be a problem with our configuration but I haven't found any way to fix that. Also, remote logging only applies to task logs, but process logs (specifically scheduler logs) are always local, and they took up the most space.

We tried to implement maintenance dags, but our workers are running from a different location to the rest of airflow, particularly the scheduler, so only task logs were getting cleaned. We could get around this by creating a new worker that shares logs with the scheduler, but we prefer not to create extra workers.

We haven't tried to implement either of the other two suggestions yet. But that is why I want to understand, how are other people solving this, and what is the recommended way?

Gayle
  • 41
  • 5

0 Answers0