I kind-of "inherited" a project that uses Airflow 2.2.4 installed on a cluster of several nodes (meaning that I wasn't part of the deployment decisions and configurations and I might not be aware of some under-the-hood processes). Each node runs a scheduler, a CeleryExecutor and a webserver. Task logging is done locally on the nodes' file system. However there must be some misconfiguration somewhere and I can't figure it out. Here is what I have observed:
- a task is executed on node A,
1.log
is written in the log folder on the same node A, and the log is visible in the web UI - so far so good - the task fails, the retry mechanism comes in, the task is re-executed on node B,
2.log
is written in the log folder on node B, and this last log is visible in the UI - however at this point the UI fails to display
1.log
and the problem is that it tries to fetch it from node B rather than node A (I checked that1.log
effectively exists on node A)
Example of UI error message:
*** Log file does not exist: [install_path]/airflow/logs/start_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/1.log
*** Fetching from: http://nodeb.mycompany.com:19793/log/start_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/1.log
*** Failed to fetch log file from worker. Client error '404 NOT FOUND' for url 'http://nodeb.mycompany.com:19793/log/start_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/1.log'
For more information check: https://httpstatuses.com/404
Example of correct log fetching message:
*** Log file does not exist: [install_path]/airflow/logs/start_msci_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/2.log
*** Fetching from: http://nodeb.mycompany.com:19793/log/start_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/2.log
Sorry I had to mask out some sensitive info. More than happy to provide more details about the configuration or else, not sure what can be useful here.