Airflow Version 2.0.2
I have three schedulers running in a kubernetes cluster running the CeleryExecutor
with a postgres backend. Everything seems to run fine for a couple of weeks, but then the airflow scheduler stops scheduling some tasks. I've done an airflow db reset
followed by an airflow db init
and a fresh deployment of the airflow-specific images. Below are some of the errors I've received from logging in the database:
According to https://github.com/apache/airflow/issues/19811 the slot_pool
issue is expected behavior, but I cannot figure out why DAGs suddenly stop being scheduled on time. For reference, there are ~500 DAGs being run every 15 minutes.
LOG: could not receive data from client: Connection timed out
STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
FROM slot_pool FOR UPDATE NOWAIT
The slot_pool
table looks like this:
select * from slot_pool;
id | pool | slots | description
----+--------------+-------+--------------
1 | default_pool | 128 | Default pool
(1 row)
I have looked at several posts, but none of the posts seem to explain the issue or provide a solution. Below are a few of them:
Airflow initdb slot_pool does not exists
Running multiple Airflow Schedulers cause Postgres locking issues
Airflow tasks get stuck at "queued" status and never gets running