A great answer
Too much? Anyway.
A lot of the other answers are a bit square-peg-round-hole. Adding complicated new operators, abusing built in variables, or somewhat failing to answer the question. I wasn't particularly happy with any of them, as they either hide their behaviour when viewed through the web UI, are prone to breaking, or require a lot of custom code (that's also prone to breaking).
This solution uses built in functionality, requires no new operators and limited aditional code, the DAGs are visible through the UI without any tricks, and follows airflow best practice (see idempotency).
The solution to this problem is fairly complicated, so I've split it into several parts. These are:
- How to safely trigger a dynamic number of tasks
- How to wait for all of these tasks to finish then call a final task
- How to integrate this into your task pipeline
- Limitations (nothing is perfect)
Can a task trigger a dynamic number of other tasks?
Yes. Sortof. Without needing to write any new operators, it's possible to have a DAG trigger a dynamic number of other DAGs, using only builtin operators. This can then be expanded to have a DAG depend on a dynamic number of other DAGs (see waiting for tasks to finish). This is similar to flinz's solution, but more robust and with much less custom code.
This is done using a BranchPythonOperator that selectively triggers 2 other TriggerDagRunOperators. One of these recursively re-calls the current DAG, the other calls an external dag, the target function.
An example config that can be used to trigger the dag is given at the top of recursive_dag.py.
print_conf.py (an example DAG to trigger)
from datetime import timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def print_output(dag_run):
dag_conf = dag_run.conf
if 'output' in dag_conf:
output = dag_conf['output']
else:
output = 'no output found'
print(output)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'print_output',
start_date=days_ago(2),
tags=['my_test'],
default_args=default_args,
description='A simple test DAG',
schedule_interval=None
) as dag:
print_output = PythonOperator(
task_id='print_output_task',
python_callable=print_output
)
recursive_dag.py (Where the magic happens)
"""
DAG that can be used to trigger multiple other dags.
For example, trigger with the following config:
{
"task_list": ["print_output","print_output"],
"conf_list": [
{
"output": "Hello"
},
{
"output": "world!"
}
]
}
"""
from datetime import timedelta
import json
from airflow import DAG
from airflow.operators.python import BranchPythonOperator
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
dag_id = 'branch_recursive'
branch_id = 'branch_operator'
repeat_task_id = 'repeat_dag_operator'
repeat_task_conf = repeat_task_id + '_conf'
next_task_id = 'next_dag_operator'
next_task_conf = next_task_id + '_conf'
def choose_branch(task_instance, dag_run):
dag_conf = dag_run.conf
task_list = dag_conf['task_list']
next_task = task_list[0]
later_tasks = task_list[1:]
conf_list = dag_conf['conf_list']
# dump to string because value is stringified into
# template string, is then parsed.
next_conf = json.dumps(conf_list[0])
later_confs = conf_list[1:]
task_instance.xcom_push(key=next_task_id, value=next_task)
task_instance.xcom_push(key=next_task_conf, value=next_conf)
if later_tasks:
repeat_conf = json.dumps({
'task_list': later_tasks,
'conf_list': later_confs
})
task_instance.xcom_push(key=repeat_task_conf, value=repeat_conf)
return [next_task_id, repeat_task_id]
return next_task_id
def add_braces(in_string):
return '{{' + in_string + '}}'
def make_templated_pull(key):
pull = f'ti.xcom_pull(key=\'{key}\', task_ids=\'{branch_id}\')'
return add_braces(pull)
with DAG(
dag_id,
start_date=days_ago(2),
tags=['my_test'],
default_args=default_args,
description='A simple test DAG',
schedule_interval=None
) as dag:
branch = BranchPythonOperator(
task_id=branch_id,
python_callable=choose_branch
)
trigger_next = TriggerDagRunOperator(
task_id=next_task_id,
trigger_dag_id=make_templated_pull(next_task_id),
conf=make_templated_pull(next_task_conf)
)
trigger_repeat = TriggerDagRunOperator(
task_id=repeat_task_id,
trigger_dag_id=dag_id,
conf=make_templated_pull(repeat_task_conf)
)
branch >> [trigger_next, trigger_repeat]
This solution has the advantage of using very limited custom code. flinz's solution can fail part way through, resulting in some scheduled tasks and others not. Then on retry, DAGS may either be scheduled to run twice, or fail on the first dag resulting in partially complete work done by a failed task. This approach will tell you which DAGs have failed to trigger, and retry only the DAGs that failed to trigger. Therefore this approach is idempotent, the other isn't.
Can a DAG depend on a dynamic number of other DAGS?
Yes, but... This can be easily done if tasks don't run in parallel. Running in parallel is more complicated.
To run in sequence, the important changes are using wait_for_completion=True
in trigger_next
, use a python operator to setup the xcom values before "trigger_next", and adding a branch operator that either enables or disables the repeat task, then having a linear dependence
setup_xcom >> trigger_next >> branch >> trigger_repeat
To run in parallel, you can similarily recursively chain several ExternalTaskSensors that use templated external_dag_id
values, and the timestamps associated with the triggered dag runs. To get the triggered dag timestamp, you can trigger a dag using the timestamp of the triggering dag. Then these sensors one by one wait for all of the created DAGs to complete, then trigger a final DAG. Code below, this time I've added a random sleep to the print output DAG, so that the wait dags actually do some waiting.
Note: recurse_wait_dag.py now defines 2 dags, both need to be enabled for this all to work.
An example config that can be used to trigger the dag is given at the top of recurse_wait_dag.py
print_conf.py (modified to add a random sleep)
"""
Simple dag that prints the output in DAG config
Used to demo TriggerDagRunOperator (see recursive_dag.py)
"""
from datetime import timedelta
from time import sleep
from random import randint
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def print_output(dag_run):
sleep_time = randint(15,30)
print(f'sleeping for time: {sleep_time}')
sleep(sleep_time)
dag_conf = dag_run.conf
if 'output' in dag_conf:
output = dag_conf['output']
else:
output = 'no output found'
print(output)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'print_output',
start_date=days_ago(2),
tags=['my_test'],
default_args=default_args,
description='A simple test DAG',
schedule_interval=None
) as dag:
print_output = PythonOperator(
task_id='print_output_task',
python_callable=print_output
)
recurse_wait_dag.py (where even more magic happens)
"""
DAG that can be used to trigger multiple other dags,
waits for all dags to execute, then triggers a final dag.
For example, trigger the DAG 'recurse_then_wait' with the following config:
{
"final_task": "print_output",
"task_list": ["print_output","print_output"],
"conf_list": [
{
"output": "Hello"
},
{
"output": "world!"
}
]
}
"""
from datetime import timedelta
import json
from airflow import DAG
from airflow.operators.python import BranchPythonOperator, PythonOperator
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.utils.dates import days_ago
from airflow.sensors.external_task import ExternalTaskSensor
from airflow.utils import timezone
from common import make_templated_pull
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
def to_conf(id):
return f'{id}_conf'
def to_execution_date(id):
return f'{id}_execution_date'
def to_ts(id):
return f'{id}_ts'
recurse_dag_id = 'recurse_then_wait'
branch_id = 'recursive_branch'
repeat_task_id = 'repeat_dag_operator'
repeat_task_conf = to_conf(repeat_task_id)
next_task_id = 'next_dag_operator'
next_task_conf = to_conf(next_task_id)
next_task_execution_date = to_execution_date(next_task_id)
end_task_id = 'end_task'
end_task_conf = to_conf(end_task_id)
wait_dag_id = 'wait_after_recurse'
choose_wait_id = 'choose_wait'
next_wait_id = 'next_wait'
next_wait_ts = to_ts(next_wait_id)
def choose_branch(task_instance, dag_run, ts):
dag_conf = dag_run.conf
task_list = dag_conf['task_list']
next_task = task_list[0]
# can't have multiple dag runs of same DAG with same timestamp
assert next_task != recurse_dag_id
later_tasks = task_list[1:]
conf_list = dag_conf['conf_list']
next_conf = json.dumps(conf_list[0])
later_confs = conf_list[1:]
triggered_tasks = dag_conf.get('triggered_tasks', []) + [(next_task, ts)]
task_instance.xcom_push(key=next_task_id, value=next_task)
task_instance.xcom_push(key=next_task_conf, value=next_conf)
task_instance.xcom_push(key=next_task_execution_date, value=ts)
if later_tasks:
repeat_conf = json.dumps({
'task_list': later_tasks,
'conf_list': later_confs,
'triggered_tasks': triggered_tasks,
'final_task': dag_conf['final_task']
})
task_instance.xcom_push(key=repeat_task_conf, value=repeat_conf)
return [next_task_id, repeat_task_id]
end_conf = json.dumps({
'tasks_to_wait': triggered_tasks,
'final_task': dag_conf['final_task']
})
task_instance.xcom_push(key=end_task_conf, value=end_conf)
return [next_task_id, end_task_id]
def choose_wait_target(task_instance, dag_run):
dag_conf = dag_run.conf
tasks_to_wait = dag_conf['tasks_to_wait']
next_task, next_ts = tasks_to_wait[0]
later_tasks = tasks_to_wait[1:]
task_instance.xcom_push(key=next_wait_id, value=next_task)
task_instance.xcom_push(key=next_wait_ts, value=next_ts)
if later_tasks:
repeat_conf = json.dumps({
'tasks_to_wait': later_tasks,
'final_task': dag_conf['final_task']
})
task_instance.xcom_push(key=repeat_task_conf, value=repeat_conf)
def execution_date_fn(_, task_instance):
date_str = task_instance.xcom_pull(key=next_wait_ts, task_ids=choose_wait_id)
return timezone.parse(date_str)
def choose_wait_branch(task_instance, dag_run):
dag_conf = dag_run.conf
tasks_to_wait = dag_conf['tasks_to_wait']
if len(tasks_to_wait) == 1:
return end_task_id
return repeat_task_id
with DAG(
recurse_dag_id,
start_date=days_ago(2),
tags=['my_test'],
default_args=default_args,
description='A simple test DAG',
schedule_interval=None
) as recursive_dag:
branch = BranchPythonOperator(
task_id=branch_id,
python_callable=choose_branch
)
trigger_next = TriggerDagRunOperator(
task_id=next_task_id,
trigger_dag_id=make_templated_pull(next_task_id, branch_id),
execution_date=make_templated_pull(next_task_execution_date, branch_id),
conf=make_templated_pull(next_task_conf, branch_id)
)
trigger_repeat = TriggerDagRunOperator(
task_id=repeat_task_id,
trigger_dag_id=recurse_dag_id,
conf=make_templated_pull(repeat_task_conf, branch_id)
)
trigger_end = TriggerDagRunOperator(
task_id=end_task_id,
trigger_dag_id=wait_dag_id,
conf=make_templated_pull(end_task_conf, branch_id)
)
branch >> [trigger_next, trigger_repeat, trigger_end]
with DAG(
wait_dag_id,
start_date=days_ago(2),
tags=['my_test'],
default_args=default_args,
description='A simple test DAG',
schedule_interval=None
) as wait_dag:
py_operator = PythonOperator(
task_id=choose_wait_id,
python_callable=choose_wait_target
)
sensor = ExternalTaskSensor(
task_id='do_wait',
external_dag_id=make_templated_pull(next_wait_id, choose_wait_id),
execution_date_fn=execution_date_fn
)
branch = BranchPythonOperator(
task_id=branch_id,
python_callable=choose_wait_branch
)
trigger_repeat = TriggerDagRunOperator(
task_id=repeat_task_id,
trigger_dag_id=wait_dag_id,
conf=make_templated_pull(repeat_task_conf, choose_wait_id)
)
trigger_end = TriggerDagRunOperator(
task_id=end_task_id,
trigger_dag_id='{{ dag_run.conf[\'final_task\'] }}'
)
py_operator >> sensor >> branch >> [trigger_repeat, trigger_end]
Integrating with your code
That's great, but you want to actually use this. So, what do you need to do? The question includes an example trying to do the following:
|---> Task B.1 --|
|---> Task B.2 --|
Task A ------|---> Task B.3 --|-----> Task C
| .... |
|---> Task B.N --|
To achieve the question goal (example implementation below), you need to separate Tasks A, B and C into their own DAG. Then, in DAG A add a new operator at the end that triggers the above DAG 'recurse_then_wait'. Pass into this dag a config that includes the config needed for each B DAG, as well as the B dag id (this can be easily changed to use different dags, go nuts). Then include the name of DAG C, the final DAG, to be run at the end. This config should look like this:
{
"final_task": "C_DAG",
"task_list": ["B_DAG","B_DAG"],
"conf_list": [
{
"b_number": 1,
"more_stuff": "goes_here"
},
{
"b_number": 2,
"foo": "bar"
}
]
}
When implemented it should look something like this:
trigger_recurse.py
from datetime import timedelta
import json
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.utils.dates import days_ago
from recurse_wait_dag import recurse_dag_id
def add_braces(in_string):
return '{{' + in_string + '}}'
def make_templated_pull(key, task_id):
pull = f'ti.xcom_pull(key=\'{key}\', task_ids=\'{task_id}\')'
return add_braces(pull)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
setup_trigger_conf_id = 'setup_trigger_conf'
trigger_conf_key = 'trigger_conf'
def setup_trigger_conf(task_instance):
trigger_conf = {
'final_task': 'print_output',
'task_list': ['print_output','print_output'],
'conf_list': [
{
'output': 'Hello'
},
{
'output': 'world!'
}
]
}
print('Triggering the following tasks')
for task, conf in zip(trigger_conf['task_list'], trigger_conf['conf_list']):
print(f' task: {task} with config {json.dumps(conf)}')
print(f'then waiting for completion before triggering {trigger_conf["final_task"]}')
task_instance.xcom_push(key=trigger_conf_key, value=json.dumps(trigger_conf))
with DAG(
'trigger_recurse_example',
start_date=days_ago(2),
tags=['my_test'],
default_args=default_args,
description='A simple test DAG',
schedule_interval=None
) as dag:
py_operator = PythonOperator(
task_id=setup_trigger_conf_id,
python_callable=setup_trigger_conf
)
trigger_operator = TriggerDagRunOperator(
task_id='trigger_call_and_wait',
trigger_dag_id=recurse_dag_id,
conf=make_templated_pull(trigger_conf_key, setup_trigger_conf_id)
)
py_operator >> trigger_operator
All of this ends up looking something like the below, with vertical and horizontal lines to show where a DAG triggers another DAG:
A
|
Recurse - B.1
|
Recurse - B.2
|
...
|
Recurse - B.N
|
Wait for B.1
|
Wait for B.2
|
...
|
Wait for B.N
|
C
Limitations
Tasks are no longer visible on a single graph. This is probably the biggest problem with this approach. By adding tags to all associated DAGs, the DAGs can at least be viewed together. However relating multiple parallel runs of DAG B to runs of DAG A is messy. However, as a single DAG run shows its input conf, this means that each DAG B run doesn't depend on DAG A, only on it's input config. Therefore this relation can be at least partially ignored.
Tasks can no longer communicate using xcom. The B tasks can receive input from task A via DAG config, however task C can't get output from the B tasks. The results of all the B tasks should be put into a known location then read by task C.
The config argument to 'recurse_and_wait' could maybe be improved to combine task_list and conf_list, but this solves the problem as stated.
There's no config for the final DAG. That should be trivial to solve.