0

Having problem where unable to turn on a DAG in the airflow webserver UI.

One thing to note is that the DAG in question originally was causing timeout errors when trying to run so I have edited the airflow.cfg file to have line...

dagbag_import_timeout = 300

Now after making this change, running...

airflow list_dags

can see the dag gets built successfully.

Then going to webserver, refresh dag in UI, switch the DAG status to "On", click on DAG to attempt to see the graph view.

Either get message about timeing out like...

Broken DAG: [/home/airflow/airflow/dags/mydag.py] Timeout, PID: 44818

(despite the dag appearing to build successfully during the airflow list_dags command) or the webserver page shows some browser error like "page sent no data" and after reloading, I see that the DAG has been switched off (in either case, no indication of problem in the airflow-webserver.log). I even notice that other dags that would normally run pretty fast, are running much slower now.

Due to the fact that the dag appears to be able to build when manually running airflow list_dags but not in the webserver, I think that perhaps I need to change one of the webserver timeout configs, eg.

# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = ...
# Number of seconds the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = ...
...
log_fetch_timeout_sec = ...
...

but not experienced enough with the underlying mechanisms of airflow to determine how these may be connected.

More debugging info if it helps:

[root@airflowetl airflow]# ps -aux | grep webserver
airflow   16740  0.8  0.2 782620 134804 ?       S    15:17   0:06 [ready] gunicorn: worker [airflow-webserver]
airflow   29758  2.3  0.2 756164 108644 ?       S    15:26   0:03 [ready] gunicorn: worker [airflow-webserver]
airflow   33820 14.8  0.1 724788 78036 ?        S    15:29   0:01 gunicorn: worker [airflow-webserver]
airflow   33854 26.7  0.1 724784 78032 ?        S    15:29   0:01 gunicorn: worker [airflow-webserver]
airflow   33855 26.5  0.1 724816 78064 ?        S    15:29   0:01 gunicorn: worker [airflow-webserver]
root      34072  0.0  0.0 112712   968 pts/0    S+   15:29   0:00 grep --color=auto webserver
airflow   91174  1.6  0.1 735708 82468 ?        S    14:14   1:14 /usr/bin/python3 /home/airflow/.local/bin/airflow webserver -D
airflow   91211  0.0  0.1 355040 53472 ?        S    14:14   0:01 gunicorn: master [airflow-webserver]

Anyone with more airflow experience have any ideas why this could be happening and how to fix? (Maybe some airflow.cfg timeout config that I should extend)?


Update:

After further debugging, the problem appears to be with a particular task is configured/created in the dag. The DAG definition itself is not very straight forward and very application specific, so need to try to parse it a bit more into something sensical and readable before posting. Though this still does not explain why the dag appears to build during airflow list_dags, but not in the webserver.

Going with what I can measure, timing the airflow list_dags command (just running with time utility) with and without the one change, the time difference is...

before change: real 1m31.201s
after change: real  2m39.744s

Update:

After more debugging, I suspect the issue is ultimately with the webserver. Always able to build the dag when running airflow list_dags, but when other dags are running, not able to click on the dag in the webserver w/out timeout errors thrown. When no other dags are running, able to view the dag (tree and graph) in the webserver, but upon going back to the main screen, see the same "Broken DAG; ... Timeout, PID: 1234" error as before

lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102
  • Is it dag-specific? Then post `mydag.py` here. If it not dag-specific (i.e. problem reproduced with a stubbed dag) then post airflow scheduler logs. – Sergey Romanovsky Jan 16 '20 at 01:48
  • Asked this question on the airflow project mailing list and still getting some useful info on the issue. Others interested can follow the thread here: https://lists.apache.org/thread.html/r19014dbaec4042f4f15973644d194137429c93c6e23bea3314230c41%40%3Cusers.airflow.apache.org%3E – lampShadesDrifter Jan 20 '20 at 19:45

3 Answers3

0

Please also share your dag definition if possible.
Over the top of my head, I think you haven't started the airflow scheduler by doing

$ airflow scheduler

along with webserver (in a separate terminal, if you have not daemonized it).
Airflow scheduler as well as webserver have to be running simultaneously for you to run dags and see the progress on webserver.

saadi
  • 646
  • 6
  • 29
  • Scheduler was running. Can confirm since was checking the airflow-scheduler.log file while was trying to debug and it was getting filled in as another dag was running at the time. I strongly suspect that it is an issue with the webserver itself right now (and the fact that the dag takes some time to build), please see me updated post and if you have any ideas (or disagree with my suspicion) please let me know. – lampShadesDrifter Jan 16 '20 at 20:46
0

Timing the airflow list_dags command (just running with time utility) with and without the one change, the time difference is...

before change: real 1m31.201s
after change: real  2m39.744s

Looking at the airflow.cfg file for anything related to the webserver and timeouts (especially timeout values < 2m39s) where the seconds count was less that the new dag build time, found...

# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = 120

# Number of seconds the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = 120
...
# The amount of time (in secs) webserver will wait for initial handshake
# while fetching logs from other worker machine
log_fetch_timeout_sec = 5

Running the webserver (not as a deamon process) and just watching the output, when it attempts to filll the dab bag with the newly modified dag, I see errors on the gnuicorn workers like...

[2020-01-16 11:12:22 -1000] [137034] [CRITICAL] WORKER TIMEOUT (pid:137039)
[2020-01-16 11:12:22 -1000] [137034] [CRITICAL] WORKER TIMEOUT (pid:137040)
[2020-01-16 11:12:22 -1000] [137034] [CRITICAL] WORKER TIMEOUT (pid:137041)
[2020-01-16 11:12:22 -1000] [137034] [CRITICAL] WORKER TIMEOUT (pid:137042)
[2020-01-16 11:12:22 -1000] [137039] [INFO] Worker exiting (pid: 137039)
[2020-01-16 11:12:22 -1000] [137041] [INFO] Worker exiting (pid: 137041)
[2020-01-16 11:12:22 -1000] [137042] [INFO] Worker exiting (pid: 137042)
[2020-01-16 11:12:22 -1000] [137040] [INFO] Worker exiting (pid: 137040)

Changing web_server_worker_timeout from 120 to 300 (5 min.) and testing accessing the modified problem dag (tree and graph view) in the webserver seems to be much faster (and after seeing a timeout error continue to pop up in the webserver after initially starting) the problem appears to be fixed.


Note that still seeing timeout errors pop up (sometimes when refreshing main page) in the webserver (though can't find in the webserver logs). Not totally sure what underlying mechanics of this problem are and would appreciate any further explanation in the comments, but the problem does appear to be fixed at this point. I also notice that all DAGs seem to run slower after the problem dag was modified, ie. appears that more time passes before next tasks are run (and changing it back to take up less dagbag fill time seems to fix). Not sure what to make of these problems at the moment, so any suggestions would be appreciated.


UPDATE:

After restarting the server, the DAG runs totally fine so long as no other dag is running at the same time. IDK the possible underlying reasons for this (since no other dags seemed to have had this problem).

lampShadesDrifter
  • 3,925
  • 8
  • 40
  • 102
  • Ultimately appears that the dag build and slow webserver problems were related to the problems addressed here: https://stackoverflow.com/a/59995882/8236733 – lampShadesDrifter Feb 03 '20 at 10:07
0

I had similar issue and in my case it was dependant services wasn't running. My task was dependent on MongoDB and redis and it was not running so I just started them and had no more errors after that.

Shakeel
  • 1,869
  • 15
  • 23