1

Airflow Version 2.0.2

I have three schedulers running in a kubernetes cluster running the CeleryExecutor with a postgres backend. Everything seems to run fine for a couple of weeks, but then the airflow scheduler stops scheduling some tasks. I've done an airflow db reset followed by an airflow db init and a fresh deployment of the airflow-specific images. Below are some of the errors I've received from logging in the database:

According to https://github.com/apache/airflow/issues/19811 the slot_pool issue is expected behavior, but I cannot figure out why DAGs suddenly stop being scheduled on time. For reference, there are ~500 DAGs being run every 15 minutes.

LOG: could not receive data from client: Connection timed out

STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots                                     
FROM slot_pool FOR UPDATE NOWAIT   

The slot_pool table looks like this:

select * from slot_pool;
 id |     pool     | slots | description
----+--------------+-------+--------------
  1 | default_pool |   128 | Default pool
(1 row)

I have looked at several posts, but none of the posts seem to explain the issue or provide a solution. Below are a few of them:

Airflow initdb slot_pool does not exists

Running multiple Airflow Schedulers cause Postgres locking issues

Airflow tasks get stuck at "queued" status and never gets running

sempervent
  • 833
  • 2
  • 11
  • 23

0 Answers0