0

I'm pretty new with Airflow, and I'm having this problem: I have a dag that process txt files an convert them to csv, this is the configuration:

one_days_ago = datetime.combine(datetime.today() - timedelta(1),
datetime.min.time())

default_args = {
    'owner': 'airflow',
    'depends_on_past': False, 
    'start_date': one_days_ago,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1),
    'max_active_runs':1,
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
}
dag = DAG('process_file', default_args=default_args, schedule_interval='@daily')

The problem is that when the dag runs, process the file from the day, but also gives previous run results, so I don't have only one csv file, just from today, I have that one and other 4 or 5 files from previous days. I have read about backfill, but I'm not sure how to avoid it or what am I doing wrong. Any suggestion? It is possible to clean successful running results from previous executions?

AnaF
  • 323
  • 1
  • 3
  • 8
  • 2
    looks similar to: http://stackoverflow.com/questions/38751872/how-to-prevent-airflow-from-backfilling-dag-runs – Ziggy Eunicien Nov 14 '16 at 22:41
  • Thanks Ziggy! I have followed one of the advices from that post: _If you go to the Tree view and click on a specific task (square boxes), a dialog button will come up with a 'mark success' button. Clicking 'past', then clicking 'mark success' will label all the instances of that task in DAG as successful and they will not be run._ and works perfectly! :) – AnaF Nov 17 '16 at 13:25
  • Possible duplicate of [How to prevent airflow from backfilling dag runs?](https://stackoverflow.com/questions/38751872/how-to-prevent-airflow-from-backfilling-dag-runs) – Guille Dec 12 '17 at 14:06

1 Answers1

0

Airflow doesn't like it when the start_date for a DAG changes. In fact in the the latest releases (1.8+) it will throw an exception if your start_date is not explicit. I'd imagine it won't rerun everything if you keep the start_date fixed.

Nick
  • 2,735
  • 1
  • 29
  • 36
  • In fact - at least with 1.7.x - start_date is pretty much ignored once a DAG has been run once. The scheduler takes over at that point and considered the start_date to be the execution date of the most recent dag run. There is actually no point at all in having a dynamic start date. If you need to reset it you can either delete all previous dag runs or rename the dag. In 1.8.x you have the ability to explicitly prevent backfill which offers the best option. The effect is to bring the start date = max(start date, now - interval) although schedule will also affect that. – David Nugent Jul 13 '17 at 23:41