0

Task parallelism in general is when multiple tasks run on the same or different set of data. But what is it in the context of airflow, when I change the parallelism parameter in the airflow.cfg file?

For instance, say I want to run a data processor on a batch of data. Will setting parallelism to 32, split the data into 32 sub-batches and run the same task on those sub-batches?

Or maybe, if somehow have 32 batches of data originally, instead of 1, I am able to run the data processor on all 32 batches(ie 32 task runs at the same time).

coderboi
  • 161
  • 3
  • 22

1 Answers1

1

The setting won't "split the data" within your DAG. From the docs:

parallelism: This variable controls the number of task instances that runs simultaneously across the whole Airflow cluster

If you want to parallel execution of a task you will need to break it further meaning create more tasks but each task does less work. That can be come handy for some ETLs.

For example:

Lets say you want to copy yesterday records from MySQL to S3.

You could do it with a single MySQLToS3Operator that reads yesterday data in a single query. However you can also break it to 2 MySQLToS3Operator reading 12 hours data or 24 operators each reading hourly data. That is up to you and the limitation of the services you are working with.

Elad Kalif
  • 14,110
  • 2
  • 17
  • 49