77

Say you have an airflow DAG that doesn't make sense to backfill, meaning that, after it's run once, running it subsequent times quickly would be completely pointless.

For example, if you're loading data from some source that is only updated hourly into your database, backfilling, which occurs in rapid succession, would just be importing the same data again and again.

This is especially annoying when you instantiate a new hourly task, and it runs N amount of times for each hour it missed, doing redundant work, before it starts running on the interval you specified.

The only solution I can think of is something that they specifically advised against in FAQ of the docs

We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing.

Is there any way to disable backfilling for a DAG, or should I do the above?

kaxil
  • 17,706
  • 2
  • 59
  • 78
m0meni
  • 16,006
  • 16
  • 82
  • 141
  • I am struggling with this behaviour, Airflow always kicks subsequent DAG run immediately after manual run. I already have catchup=False, looks like airflow always triggers scheduled run for current interval which I do not want, I want it to execute only future interval. Anyone has any idea ? – thedevd Feb 21 '23 at 08:12

3 Answers3

81

Upgrade to airflow version 1.8 and use catchup_by_default=False in the airflow.cfg or apply catchup=False to each of your dags.

https://github.com/apache/incubator-airflow/blob/master/UPDATING.md#catchup_by_default

sage88
  • 4,104
  • 4
  • 31
  • 41
  • Thanks. This is way better than the LatestOnlyOperator. – m0meni Mar 31 '17 at 13:12
  • 8
    I have set catchup_by_default=False, yet Airflow still backfills the jobs. Any idea why? I'm running version 1.8 – Ollie Glass Apr 19 '17 at 13:47
  • @OllieGlass Are you sure you applied it to the correct container, I don't know exactly what your setup is, but that definitely matters. You can also try applying it to specific DAGs if you're unsure. – sage88 May 11 '17 at 02:39
  • @OllieGlass - did you figure it out? I am getting the same non-behavior. – Nick Jun 26 '17 at 19:22
  • @sage88 - what do you mean by "correct container"? The airflow doc isn't exactly explicit on where to place the setting. – Nick Jun 26 '17 at 19:22
  • 4
    @Nick I actually wasn't able to get the default setting working either so I ended up just putting `catchup=False` on all my DAGs like `DAG('example', default_args=default_args, schedule_interval='0 5 * * *', catchup=False)` – m0meni Jun 26 '17 at 19:46
  • @AR7 - thanks for the tip - I'll give it a try. I had erroneously put "catchup" in the default args... maybe that *should* work, but it didn't for me. – Nick Jun 26 '17 at 21:42
  • 12
    @Nick the default args object consists of arguments applied to the **tasks** running under the DAG **not to the DAG** itself. I was also initially confused by this. – m0meni Jun 26 '17 at 21:50
  • also not seeing the `catch_by_default=False` working and I run airflow edge (i.e. master) – Mike Jan 20 '18 at 18:10
  • 1
    `catchup_by_default=False` in `default_args` also had no effect for me; however setting `backfill=False` on the DAG worked .... – Hans L Jan 21 '18 at 22:21
  • I'm going to have to doublecheck if we're using `backfill=False` because we had to resort to renaming the DAG on 1.8 when `catchup_by_default=False` didn't work. – szeitlin Mar 09 '18 at 01:06
  • The question should be how to prevent airflow from catch/backfilling a task on a dag where catchup=True. It is obvious how to set catchup=False now. Maybe in the old version of airflow this was not the case but the question no longer makes sense otherwise. – mathtick Apr 16 '18 at 13:48
  • I have catchup_by_default=False in both airflow.cfg and default args as well as catchup=False for the dag. They don't work at all. As soon as I unpause a new dag, it gets back filled right away. – Sam Sep 14 '18 at 00:54
  • @Sam Please see m0meni's comment above. The argument must be applied to the DAG, not the the tasks. As such it should not be included in the default args, it should be applied as a DAG argument. – sage88 Sep 15 '18 at 03:03
  • 1
    @sage88 Yep I read through all the comments to make sure. I set catchup=False for the Dag and have catchup_by_default=False in the airflow.cfg. But unpausing a Dag backfills right away in the UI. – Sam Sep 15 '18 at 08:32
  • 3
    I'm using Airflow v1.10.0 and am still seeing this issue – harveyxia Nov 15 '18 at 15:41
  • 4
    Same here, on Airflow 1.10.1. I am setting ```catchup=False``` on all dags, and I still get backfill. – flinz Nov 30 '18 at 15:57
  • I'm using Airflow 1.10.2 and setting backfill=False (applied to the DAG). Backfill still occurred. – sbraden Sep 23 '19 at 15:19
  • I'm using airflow 1.10.5. I am setting catchup=False, but Dag backfills again. Is this issue solved? – Bhanu Prakash Sep 15 '21 at 12:43
20

Setting catchup=False in your dag declaration will provide this exact functionality.

I don't have the "reputation" to comment, but I wanted to say that catchup=False was designed (by me) for this exact purpose. In addition, I can verify that in 1.10.1 it is working when set explicitly in the instantiation. However I do not see it working when placed in the default args. I've been away from Airflow for 18 months though, so it will be a bit before I can take a look at why the default args isn't working for catchup.

dag = DAG('example_dag',
        max_active_runs=3,
        catchup=False,
        schedule_interval=timedelta(minutes=5),
        default_args=default_args)
Ben Tallman
  • 209
  • 2
  • 3
  • 2
    I'm running `airflow 1.10.14` and this does not work, at least not when using the DebugExecutor – cosbor11 Dec 17 '20 at 05:44
  • Running airflow 1.10.12 and its still not working. – Prathamesh dhanawade Jan 20 '21 at 17:27
  • Just saw that by default `catchup_by_default` is a `String` set to True instead of `Boolean`. Not sure if thats an issue ! https://airflow.apache.org/docs/apache-airflow/1.10.12/configurations-ref.html#catchup-by-default Can we have this default to False as so many people dont need/have issues turning it off. – Prathamesh dhanawade Jan 20 '21 at 20:39
15

This appears to be an unsolved Airflow problem. I know I would really like to have exactly the same feature. Here is as far as I've gotten; it may be useful to others.

The are UI features (at least in 1.7.1.3) which can help with this problem. If you go to the Tree view and click on a specific task (square boxes), a dialog button will come up with a 'mark success' button. Clicking 'past', then clicking 'mark success' will label all the instances of that task in DAG as successful and they will not be run. The top level DAG (circles on top) can also be labeled as successful in a similar fashion, but there doesn't appear to be way to label multiple DAG instances.

I haven't looked into it deeply enough yet, but it may be possible to use the 'trigger_dag' subcommand to mark states of DAGs. see here: https://github.com/apache/incubator-airflow/pull/644/commits/4d30d4d79f1a18b071b585500474248e5f46d67d

A CLI feature to mark DAGs is in the works: http://mail-archives.apache.org/mod_mbox/airflow-commits/201606.mbox/%3CJIRA.12973462.1464369259000.37918.1465189859133@Atlassian.JIRA%3E https://github.com/apache/incubator-airflow/pull/1590

UPDATE (9/28/2016): A new operator 'LatestOnlyOperator' has been added (https://github.com/apache/incubator-airflow/pull/1752) which will only run the latest version of downstream tasks. Sounds very useful and hopefully it will make it into the releases soon

UPDATE 2: As of airflow 1.8, the LatestOnlyOperator has been released.

m0meni
  • 16,006
  • 16
  • 82
  • 141
Ziggy Eunicien
  • 2,858
  • 1
  • 23
  • 28
  • Update looks really promising! Thanks for keeping up with the question. – m0meni Sep 30 '16 at 13:14
  • 2
    Note that LatestOnlyOperator sets the downstream tasks to a 'skipped' state. Per the docs, the skipped states propagate such that where all directly upstream tasks are also skipped. This makes the approach unsuitable when you'd (ew) like the upstream jobs to run successfully with out-of-date data. In that case the best solution is to add an early operator in your code that escapes to success if the task is being run particularly late. – russellpierce Jan 20 '17 at 11:52
  • 1
    The backfill command for the cli looks like it's now available and is probably the best way to do this for now. https://airflow.incubator.apache.org/cli.html airflow backfill -h [hostname here] -m=True -s [startdate] -e $(date +"%Y-%m-%dT:%H:%M:%S") – sage88 Mar 30 '17 at 15:41
  • I have tried the backfill mark success script trick, and it doesn't actually work to stop all running tasks/prevent backfilling (at least in 1.8). Hopefully it will work in future versions. Doing it manually through the UI works, but that's really only feasible if you're dealing with small numbers of backfilling tasks. – szeitlin Mar 09 '18 at 01:07