20

I've read Airflow's FAQ about "What's the deal with start_date?", but it still isn't clear to me why it is recommended against using dynamic start_date.

To my understanding, a DAG's execution_date is determined by the minimum start_date between all of the DAG's tasks, and subsequent DAG Runs are ran at the latest execution_date + schedule_interval.

If I set my DAG's default_args start_date to be for, say, yesterday at 20:00:00, with a schedule_interval of 1 day, how would that break or confuse the scheduler, if at all? If I understand correctly, the scheduler would trigger the DAG with an execution_date of yesterday at 20:00:00, and the next DAG Run would be scheduled for today at 20:00:00.

Is there some concept that I'm missing?

astronomotrous
  • 431
  • 1
  • 3
  • 12

2 Answers2

17

First run would be at start_date+schedule_interval. It doesn't run dag on start_date, it always runs on start_date+schedule_interval.

As they mentioned in document if you give start_date dynamic for e.g. datetime.now() and give some schedule_interval(1 hour), it will never execute that run as now() moves along with time and datetime.now()+ 1 hour is not possible

liferacer
  • 2,473
  • 2
  • 17
  • 16
  • 1
    so how often does the scheduler calculate `start_date`? does it calculate it before every run? – astronomotrous Dec 19 '16 at 23:22
  • 1
    I think I'm getting confused by this: if `start_date` is `datetime.now()` at time `t`, then `t` should have been saved somewhere, right? so when `t + 1` finally comes around, the scheduler should know to start the run, given that it doesn't calculate `start_date` again – astronomotrous Dec 19 '16 at 23:24
  • 1
    @earthican This is not clearly mentioned in the documentation. Other thing I would like to mention if you are making any changes in start_date or schedule_interval, always modify the name of dag like my_dag_v1 or something. Your changes with start_date or interval will not work if you don't change the name of dag – liferacer Dec 21 '16 at 21:07
  • 7
    I accept that using `datetime.now()` is a bad idea but what is the alternative when I'm writing a DAG that will be deployed at an undetermined time in the future? I have a scenario where I don't want to set `start_time: datetime.datetime(2019,4,29)` with a schedule of every 1 hour because if I deploy it in 3 weeks time its going to run 21*24=504 times, and I don't want that to happen. – jamiet Apr 29 '19 at 13:59
  • @liferacer You say *Your changes with start_date or interval will not work if you don't change the name of dag*. Do you mean one needs to change the name of the .py file, or the value assigned to `dag_id`? – jamiet Apr 29 '19 at 14:01
  • 1
    @jamiet one option is to set catchup=False when creating the DAG, but this prevents catchup going forward too which may not be what you want. The only way around that I've seen is to set the start_date appropriately. Disclaimer: just getting started with Airflow. – totalhack Aug 13 '19 at 19:06
  • Setting catchup=False will only take care of previous backfills but still very often create atleast one instance when you unpause the dag after some time of pausing. This behavious is good for ETL processes but for crons not as intuitive. As an example if I have something scheduled at 10 every day and i pause it for a few days. If today I unpause it at 8, it is going to immediately trigger a run for the previous 1 day atleast even when catchup=False. – raaj Apr 19 '21 at 12:25
1

The scheduler expects to see a constant start date and interval. If you change it the scheduler might not notice until it reloads the DagBag, and if the new start date doesn't line up with your old schedule it might break depends_on_past behavior.

If you don't need depends_on_past the simplest might be to stop using the scheduler, set the start date to some arbitrary old date, and externally trigger the DAG however you like using a crontab or similar.

dgies
  • 56
  • 3