1

I know in the cfg I can set the parallelism, but is there a way to do it per task, or at least per dag?

dag1=

task_id: 'download_sftp'
parallelism: 4 #I am fine with downloading multiple files at once


task_id: 'process_dimensions'
parallelism: 1 #I want to make sure the dimensions are processed one at a time to prevent conflicts with my 'serial' keys

task_id: 'process_facts'
parallelism: 4 #It is fine to have multiple tables processed at once since there will be no conflicts

dag2 (separate file)=

task_id: 'bcp_query'
parallelism: 6 #I can query separate BCP commands to download data quickly since it is very small amounts of data
trench
  • 5,075
  • 12
  • 50
  • 80

2 Answers2

0

You can create a task pool through the web gui and limit the execution parallelism by specifying the specific tasks to use that pool.

Please see: https://airflow.apache.org/concepts.html#pools

Necravolver
  • 699
  • 5
  • 10
  • 2
    Is there more I can read about pools? Do I just name a pool as a string and then Airflow magically handles everything? I haven't branched into this area before so I want to make sure I understand what's going on. The one thing I want to avoid is having two tasks try to update a dimension table or something and cause a conflict (I am using postgres and psycopg2 COPY EXPERT to load my data). So for the dimension table updates, I'd like each source to do it one at a time, but for SFTP downloads and fact table loading, I can have several process at once. – trench Apr 12 '17 at 12:21
0

The number of active DAG runs can be controlled with the below parameter(present in airflow.cfg configuration file), its applicably globally. By default, its set to 16, change it to 1 ensures only one instace of dag at a time and rest gets queued.

#The maximum number of active DAG runs per DAG

max_active_runs_per_dag = 16

How to limit Airflow to run only 1 DAG run at a time? --> Suggests how to control concurrency per dag

Sathish
  • 332
  • 4
  • 11
  • Note that as told in [this](https://stackoverflow.com/a/43179385/3679900) answer, `max_active_runs_per_dag` and `max_active_runs` are different. `max_active_runs_per_dag` is a global configuration setting whereas `max_active_runs` is a parameter of `DAG` class. In other words, if `max_active_runs` is not provided while instantiating a `DAG` object, it defaults to `max_active_runs_per_dag` configuration setting of `Airflow` – y2k-shubham Sep 04 '18 at 07:34