Having problem where unable to turn on a DAG in the airflow webserver UI.
One thing to note is that the DAG in question originally was causing timeout errors when trying to run so I have edited the airflow.cfg file to have line...
dagbag_import_timeout = 300
Now after making this change, running...
airflow list_dags
can see the dag gets built successfully.
Then going to webserver, refresh dag in UI, switch the DAG status to "On", click on DAG to attempt to see the graph view.
Either get message about timeing out like...
Broken DAG: [/home/airflow/airflow/dags/mydag.py] Timeout, PID: 44818
(despite the dag appearing to build successfully during the airflow list_dags
command) or the webserver page shows some browser error like "page sent no data" and after reloading, I see that the DAG has been switched off (in either case, no indication of problem in the airflow-webserver.log
). I even notice that other dags that would normally run pretty fast, are running much slower now.
Due to the fact that the dag appears to be able to build when manually running airflow list_dags
but not in the webserver, I think that perhaps I need to change one of the webserver timeout configs, eg.
# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = ...
# Number of seconds the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = ...
...
log_fetch_timeout_sec = ...
...
but not experienced enough with the underlying mechanisms of airflow to determine how these may be connected.
More debugging info if it helps:
[root@airflowetl airflow]# ps -aux | grep webserver
airflow 16740 0.8 0.2 782620 134804 ? S 15:17 0:06 [ready] gunicorn: worker [airflow-webserver]
airflow 29758 2.3 0.2 756164 108644 ? S 15:26 0:03 [ready] gunicorn: worker [airflow-webserver]
airflow 33820 14.8 0.1 724788 78036 ? S 15:29 0:01 gunicorn: worker [airflow-webserver]
airflow 33854 26.7 0.1 724784 78032 ? S 15:29 0:01 gunicorn: worker [airflow-webserver]
airflow 33855 26.5 0.1 724816 78064 ? S 15:29 0:01 gunicorn: worker [airflow-webserver]
root 34072 0.0 0.0 112712 968 pts/0 S+ 15:29 0:00 grep --color=auto webserver
airflow 91174 1.6 0.1 735708 82468 ? S 14:14 1:14 /usr/bin/python3 /home/airflow/.local/bin/airflow webserver -D
airflow 91211 0.0 0.1 355040 53472 ? S 14:14 0:01 gunicorn: master [airflow-webserver]
Anyone with more airflow experience have any ideas why this could be happening and how to fix? (Maybe some airflow.cfg timeout config that I should extend)?
Update:
After further debugging, the problem appears to be with a particular task is configured/created in the dag. The DAG definition itself is not very straight forward and very application specific, so need to try to parse it a bit more into something sensical and readable before posting. Though this still does not explain why the dag appears to build during airflow list_dags, but not in the webserver.
Going with what I can measure, timing the airflow list_dags
command (just running with time
utility) with and without the one change, the time difference is...
before change: real 1m31.201s
after change: real 2m39.744s
Update:
After more debugging, I suspect the issue is ultimately with the webserver. Always able to build the dag when running airflow list_dags, but when other dags are running, not able to click on the dag in the webserver w/out timeout errors thrown. When no other dags are running, able to view the dag (tree and graph) in the webserver, but upon going back to the main screen, see the same "Broken DAG; ... Timeout, PID: 1234" error as before