1

How often is a dag definition file read during a single dag run?

Have a large dag that takes long amount of time to build (~1-3min). Looking at the logs of each task as the dag is running it appears that the dag definition file is being executed for every task before it runs...

*** Reading local file: /home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log
[2020-01-29 19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
[2020-01-29 19:02:10,866] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
[2020-01-29 19:02:10,866] {taskinstance.py:866} INFO - 
--------------------------------------------------------------------------------
[2020-01-29 19:02:10,866] {taskinstance.py:867} INFO - Starting attempt 1 of 1
[2020-01-29 19:02:10,866] {taskinstance.py:868} INFO - 
--------------------------------------------------------------------------------
[2020-01-29 19:02:10,883] {taskinstance.py:887} INFO - Executing <Task(BashOperator): precheck_db_perms> on 2020-01-30T04:51:34.621883+00:00
[2020-01-29 19:02:10,887] {standard_task_runner.py:52} INFO - Started process 140570 to run task
[2020-01-29 19:02:11,048] {logging_mixin.py:112} INFO - [2020-01-29 19:02:11,047] {dagbag.py:403} INFO - Filling up the DagBag from /home/airflow/airflow/dags/mydag.py
[2020-01-29 19:02:11,052] {logging_mixin.py:112} INFO - <output from my dag definition file>
[2020-01-29 19:02:11,101] {logging_mixin.py:112} INFO - <more output from my dag definition file>
....
....
....
[2020-01-29 19:02:58,651] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: mydag.mytask 2020-01-30T04:51:34.621883+00:00 [running]> airflowetl.co.local
[2020-01-29 19:02:58,674] {bash_operator.py:81} INFO - Tmp dir root location: 
 /tmp
[2020-01-29 19:02:58,674] {bash_operator.py:91} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_EMAIL=me@co.org
AIRFLOW_CTX_DAG_OWNER=me
AIRFLOW_CTX_DAG_ID=mydag
AIRFLOW_CTX_TASK_ID=mytask
AIRFLOW_CTX_EXECUTION_DATE=2020-01-30T04:51:34.621883+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2020-01-30T04:51:34.621883+00:00
[2020-01-29 19:02:58,675] {bash_operator.py:105} INFO - Temporary script location: /tmp/airflowtmphwu1ckty/mytaskbmnsizw5
<only now does the actual task logic output seem to start>

where the first whole part of the log seems to imply that the dag file is being run each time a new task is run (I see this for every task).

Is this indeed what is happening here? Is this normal / expected behavior? Note that since my dag takes some time to build, this would mean that that time is being multiplied across every task in the dag (of which there are many in this case), which makes me think this is either not normal or there is some best practice I am not using here. Could anyone with more airflow experience help explain what I'm seeing here?

lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102

1 Answers1

1

After some discussion on the airflow email list, it turns out that airflow builds the dag for each task when it is run (so each tasks includes the overhead of building the dag again (which in my case was very significant)). From the conversation

Yes, every task is run in process isolation (and could be running across separate machines) so every tasks builds the DAG from scratch.

Fundamentally tasks are defined by the code itself, and so a worker process can only determine what code to execute when running by parsing the python code that defines it. Maybe there are some cases where a task can be well defined outside of the full context of the dag that contains it but that doesn't apply in general.

It’s not that the task cares about the DAB structure it’s that the tasks only exist as part of the DAG, the only way to get a task in Airflow is to get the DAG first. At least as far I know.

Ultimately pickling the configs I use to build the graph and making that dag-config building a scheduled dag that then triggers my actual dag (with that schedule set to None) command was what I did. My dag is created in a loop that does DB queries to make DAG branches for a set of tables, so queries are involved and was causing overhead b/c the queries were being done for every single task when only really needed to do it once to make the config dict for the DAG).

Community
  • 1
  • 1
lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102