7

Does anyone use MWAA in production?

We currently have around 500 DAGs running and we see an unexpected behavior with tasks staying in a "queued" state for unknown reasons.

Task is in the 'queued' state which is not a valid state for execution. The task must be cleared in order to be run.

It happens randomly, can perfectly run for a day and then a few tasks will stay queued. The tasks will stay in this state forever unless we mark them as failed manually.

A DAG run can stay in this "queued" state even if the pool is empty, I don't see any reasons explaining this.

It happens to ~5% of the tasks with all the others running smoothly.

Did you ever encounter this behavior?

val
  • 329
  • 2
  • 16

1 Answers1

6

This was happening to me in MWAA as well. The solution, recommended to me by AWS, was adding to Airflow configuration options via the web UI the following options:

celery.sync_parallelism = 1 
core.dag_file_processor_timeout = 150
core.dagbag_import_timeout = 90 
core.min_serialized_dag_update_interval = 300
scheduler.dag_dir_list_interval = 600 
scheduler.min_file_process_interval = 300
scheduler.parsing_processes = 2 
scheduler.processor_poll_interval = 60 
Kevin Vo
  • 191
  • 1
  • 4