8

I have the following Dag: enter image description here

The first Dag with 0 1 * * * ran without any problem. The end DAG 0 10 1 * * Did not run. When I do:

import datetime
print datetime.datetime.now()

I get:

2018-07-01 12:14:15.632812

So I don't understand why this DAG hasn't been scheduled. I understand that it's not mandatory to run exactly at 10:00 but the stat should be Running.

According to the "Latest Run" of the first task with is 2018-06-30 01:00 I suspect that I don't actually understand Airflow clock. From my point of view the last run was on 2018-07-01 01:00 Because it ran today morning not yesterday.

Edit: I saw this paragraph at the documntation:

"Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended."

So I'm wondering.. I should schedule everything to one day before the actual date I want? So If I actually want something to run at 0 10 1 * * I should schedule it to 0 10 30 * * ? In other words if I want something to run on the 1st of each month at 10:00 I should schedule it to the last day of each month at 10:00 ?

Where is the logic in that? This is very hard to understand and follow.

It gets worst, According to this There is no way to tell the scheduler this input. What am I to do?!

jack
  • 821
  • 5
  • 16
  • 28

1 Answers1

14

Airflow schedules tasks to run at the END of a schedule interval. This can be a little counter intuitive, but is based around the idea that the data for a particular interval isn't available until that interval is over.

Suppose you had a workflow that is supposed to run every day. You can't get all of the data for yesterday until that day is over (today).

In your case, it would make sense that the first DAG's last run is for yesterday, since that was the "execution_date" associated for that DagRun - your DAG ran today for yesterday's data.

If you want your DAG to run on the 1st of every month, than changing the schedule isn't a bad idea. However, if you want your DAG to run for the data associated for the 1st of every month (i.e. pass that date into an API request or a SQL query), then you have it right.

Viraj Parekh
  • 1,351
  • 6
  • 14
  • 1
    Sorry but this does not answer my question. I'm aware of Airflow features. My code needs to run on 1st day of every month at 10 AM. I do not pass any date argument it simply a function that needs to run. Airflow doesn't allow me to do that. There is no cron expressions that can say last day of a month. You need to build 3 different cron expressions for that. – jack Jul 02 '18 at 05:21
  • 2
    @jack I think you're misunderstanding the "schedule interval" If your start date is `2018-06-01` and your schedule interval is `0 10 1 * *` Then on `2018-07-01 T 00:10:00` your execution for `2018-06-01 T 00:10:00` will start running. https://crontab.guru/#0_10_1_*_* – dlamblin Jul 02 '18 at 09:47
  • @dlamblin I'm not sure I understand. My start date is: 'start_date': datetime(2018, 06, 21) This is the date I lunched the DAG. – jack Jul 02 '18 at 11:11
  • 4
    So if you had your `start_date` as late June and have the interval set to run every month, it will run on the first day of July, as that's when the interval (monthly, starting in June) will have finished. – Viraj Parekh Jul 02 '18 at 19:28
  • 3
    I don't think that's a good decision on their part. It should be up to the develop to bear that in mind and the start_date should literally start the DAG on the same day as the start_date value. They're adding hidden complexity. – Kyle Bridenstine Sep 04 '18 at 20:21
  • My Airflow DAG doesn't start when I specify seconds in start_date. Is it a Bug or Feature? – vs4vijay Jan 14 '21 at 14:15