1

I'm running the standard docker-compose file locally on my computer with all defaults from airflow with airflow:2.1.4, postgres:13, redis:latest. Everything works as expected when I have one instance of the scheduler, but when I add another instance of the scheduler, I start getting locking issues.

postgres_1 | STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots FROM slot_pool FOR UPDATE NOWAIT

My relevant docker-compose file is

&airflow-common
environment:
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__WEBSERVER__WEB_SERVER_MASTER_TIMEOUT: 360
    AIRFLOW__WEBSERVER__EXPOSE_CONFIG: 'true'
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__CORE__STORE_DAG_CODE: 'false'

airflow-scheduler-1:
  <<: *airflow-common
  command: scheduler
  container_name: airflow-scheduler-1

airflow-scheduler-2:
  <<: *airflow-common
  command: scheduler
  container_name: airflow-scheduler-2

The documentation is of no use since it mentions that I can just run "airflow scheduler" multiple times and it should work out of the box. Is there some sort of HA setting that I'm missing out on?

Bran
  • 617
  • 8
  • 21
  • 1
    It should work out-of-the box. The documentation is clear about it. I think -if you have issue with locking you should raise a GitHub issue about it and include all relevant details - your configuration setup, logs errors you get, logs from Postgres etc. – Jarek Potiuk Oct 05 '21 at 09:22
  • @JarekPotiuk Our team concluded that the "error" we get from failing to acquire the lock in Postgres is an expected error, since multiple schedulers will try to update the lock, but n-1 will always fail. Strangely, the "error" occurs more often on Postgres 12/13/14 than on 11. – Bran Oct 28 '21 at 10:10
  • 1
    If you only see the locking errors in Postgres - this is expected. But we also have a change coming for that in 2.3.0 - https://github.com/apache/airflow/pull/19842 so that logs will not be littered with those "false negatives. – Jarek Potiuk Dec 15 '21 at 14:03

1 Answers1

1

The HA that comes out of the box from Airflow on the scheduler side is to run multiple schedulers on different machines.

As you noticed by now, this does create locks on the database.

Your options are limited at this point and this becomes an optimization exercise based around two broad categories -

  1. Change your DB configuration -

     Sizing the DB is usually based on the following:
    
    • number of files that regularly get parsed
    • number of schedulers
    • number of workers
    • number of tasks
    • frequency of execution
  2. Create multiple deployments per environment This usually means creating different clusters per environment with each cluster focusing on a particular workload - which means less number of files parsed/less schedulers/workers/less tasks

SunnyAk
  • 545
  • 1
  • 4
  • 15