2

I have searched quite a lot about this but could not find any substantial information about it. My problem is that I have a DAG that should backfill the data from March 2017.

So I have put the start_date: '01-03-2017'. I have also provided schedule_interval as daily. Now, I know that my DAG will start running from March 2017 with a given schedule. But if my dag will follow schedule daily, it will take more than 2 years to reach the current date

I cannot wait for 2 years to get the past data. I want my DAG to complete backfill as soon as possible so that my DAG catches the current time and start scheduling every day. How can I achieve this? Can I provide max_active_runs to some high number to schedule several DAGRuns at the same time?

tank
  • 465
  • 8
  • 22
  • 2
    This [thread](https://stackoverflow.com/questions/38200666/airflow-parallelism) talks about all the possible ways to achieve concurrency/parallelism with airflow DAGs & Tasks – hmanolov Aug 25 '19 at 18:07
  • 1
    I'm pretty sure the backfill jobs don't run "daily". The jobs starting from the day the DAG is trigger will run on a daily basis, but the ones which were supposed to run before that will get executed if there are spare resources. If your DAG takes about 15 mins to complete (probably shorter), you can backfill 4 * 24 - 1 jobs in a day assuming they run consecutively. You'll probably finish your backfilling in less than 2 weeks. – absolutelydevastated Aug 26 '19 at 11:17

1 Answers1

2

In case of a backfill, your DAG won't run only according to the schedule. It will perform daily tasks for the past, but they will run concurrently until the time it completes all the backfill tasks. Only the execution date for each of these runs will be the date in the past. Once it reaches the current date, it will then go forward with as per the schedule.

Saurav Ganguli
  • 396
  • 3
  • 18