0

I’m new to Airflow and I’m trying to understand how to use the scheduler correctly. Basically I want to schedule tasks the same way as I use cron. There’s a task that needs to be run every 5 minutes and I want it to start at the dag run next even 5 min slot after I add the DAG file to dags directory or after I have made some changes to the dag file.

I know that the DAG is run at the end of the schedule_interval. If I add a new DAG and use start_date=days_ago(0) then I will get the unnecessary runs starting from the beginning of the day. It also feels stupid to hardcode some specific start date on the dag file i.e. start_date=datetime(2019, 9, 4, 10, 1, 0, 818988). Is my approach wrong or is there some specific reason why the start_date needs to be set?

MrBronson
  • 582
  • 11
  • 21
  • 1
    Why don't you just use `start_date: datetime.now() - timedelta(minutes=5)` or something similar? – absolutelydevastated Sep 05 '19 at 10:52
  • 2
    **@absolutelydevastated**, the [docs](http://airflow.apache.org/faq.html#what-s-the-deal-with-start-date) warn against using `datetime.now()` or [*dynamic-start-date*](https://stackoverflow.com/q/41134524/3679900). Quoting the relevant line here `"..We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an @hourly DAG would never get to an hour after now as now() moves along..."` – y2k-shubham Sep 05 '19 at 15:09

1 Answers1

5

I think I found an answer to my own question from the official documentation: https://airflow.apache.org/scheduler.html#backfill-and-catchup

By turning off the catchup, DAG run is created only for the most recent interval. So then I can set the start_date to anything in the past and define the dag like this:

dag = DAG('good-dag', catchup=False, default_args=default_args, schedule_interval='*/5 * * * *')

MrBronson
  • 582
  • 11
  • 21
  • did this work for you ? the code comments says *(or only run latest)*, but from my test it schedules recent two runs (instead just one, the most latest one) – Vibhor Jain Dec 10 '19 at 04:32
  • Yes it works like this for me. I tested it with simple dag where start date was set far away in the past and dag was defined as in the answer. – MrBronson Dec 11 '19 at 12:21