3

I am a data engineer and work with airflow regularly.

When redeploying dags with a new start date the best practice is as shown in the here:

Don’t change start_date + interval: When a DAG has been run, the scheduler database contains instances of the run of that DAG. If you change the start_date or the interval and redeploy it, the scheduler may get confused because the intervals are different or the start_date is way back. The best way to deal with this is to change the version of the DAG as soon as you change the start_date or interval, i.e. my_dag_v1 and my_dag_v1. This way, historical information is also kept about the old version.

However after deleting all previous DAG and task runs I tried to redeploy a dag with a new start date. It worked as expected (with the new start date) for a day, then started to work with the old again

What are the reasons for this? In depth if you can.

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
scr
  • 121
  • 1
  • 4
  • Airflow's `scheduler` is [complex and mysterious](https://cwiki.apache.org/confluence/display/AIRFLOW/Scheduler+Basics) engineering. It has its fair share of nuances, and [major changes](https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB) have been proposed. Till then [People recommend](https://stackoverflow.com/a/49088553/3679900) to restart it periodically (which is true for pretty much every long-running process). At the least changing `start_date` or `schedule_interval` calls for an immediate restart (or you'll be in for some surprises) – y2k-shubham Jul 18 '19 at 17:08

1 Answers1

2

Airflow maintains all of the information regarding the past runs in a table dag_run.

When you clear the previous dag runs, these entries are dropped from the database. Hence, airflow treats this dag as a new dag and starts at the specified time.

Airflow checks the last dag execution time (start_date of last run) and adds the timedelta object which you have specified in schedule_interval.

If you are having difficulties even after clearing dag runs, few things you can do:

  1. Rename the dag as suggested.
  2. Clear all the dag runs, keep the dag paused. Create a dag run and then turn the dag on. It will run on the scheduled time afterwards.
  3. The best approach would be to use crontab expression inside schedule_interval.
Nitin Pandey
  • 649
  • 1
  • 9
  • 27
  • I did 3. `SCHEDULE_INTERVAL = '0 4 * * *'` I am purposefully trying to avoid 1. Also 2 makes no sense since I made it work the first day and then it misbehaved. – scr Jul 19 '19 at 16:18