I have two different DAGs that need to run in different frequencies. One i.e. dag1
needs to run weekly and the dag2
needs to run daily. Now dag2
should only run when dag1 has finished, on every occurrence when dag1
runs.
I have defined two DAGs as follows in two different python modules.
dag1.py
PROJECT_PATH = path.abspath(path.join(path.dirname(__file__), '../..'))
with DAG('dag1',
default_args={
'owner': 'airflow',
'start_date': dt.datetime(2019, 8, 19, 9, 30, 00),
'concurrency': 1,
'retries': 0
}
schedule_interval='00 10 * * 1',
catchup=True
) as dag:
CRAWL_PARAMS = BashOperator(
task_id='crawl_params',
bash_command='cd {}/scraper && scrapy crawl crawl_params'.format(PROJECT_PATH)
)
dag2.py
PROJECT_PATH = path.abspath(path.join(path.dirname(__file__), '../..'))
with DAG('dag2',
default_args = {
'owner': 'airflow',
'start_date': dt.datetime(2019, 8, 25, 9, 30, 00),
'concurrency': 1,
'retries': 0
}
schedule_interval='5 10 * * *',
catchup=True
) as dag:
CRAWL_DATASET = BashOperator(
task_id='crawl_dataset',
bash_command='''
cd {}/scraper && scrapy crawl crawl_dataset
'''.format(PROJECT_PATH)
)
Currently I have manually set a gap of 5 minutes between two dags. This setup is not working currently and also lacks the function to make dag2
dependent on dag1
as required.
I had checked the answers here and here but was not able to figure out.
NOTE: the schedule_intervals
are indicative only. The intention is to run dag1
every Monday at a fixed time and run dag2
daily on a fixed time and on Monday, it should only after dag1
finishes.
Here each dag has multiple tasks as well.