Question-1
What's the recommendation in this scenario
Always try to keep maximum stuff in single task (and preferably have fat tasks that run for several minutes than lean tasks running for few seconds) to (not exhaustive list)
Exception to this rule could be (not exhaustive list)
- 1. when idempotency dictates that you must break your tasks into smaller steps to prevent wasteful re-computation / breaking of workflow as told in Using Operators doc
An operator represents a single, ideally idempotent, task
- 2. peculiar cases such as if you are using
pools
to limit load on an external resource => in this case, each operation that touches that external resource has to be modelled as a separate task in order to enforce load-restriction via pool
s
Question-2
What's the overhead of an airflow task assuming there are enough resources?
While I can't provide a technically precise answer here, do understand that Airflow's scheduler essentially works on a poll-based approach
- at every heartbeat (usually ~ 20-30 s), it scans meta-db and
DagBag
to determine the list of task
s that are ready to run for e.g. like
- a
scheduled
task who's upstream tasks have been run
- an
up_for_retry
task who's retry_delay
has expired
From the old docs
The Airflow scheduler monitors all tasks and all DAGs, and triggers
the task instances whose dependencies have been met. Behind the
scenes, it monitors and stays in sync with a folder for all DAG
objects it may contain, and periodically (every minute or so) inspects
active tasks to see whether they can be triggered.
- this means that having more
task
s (as well as more connections / dependencies between them) will increase the workload of scheduler (more checks to be evaluated)
Suggested reads
For all these issues with running a massive number of fast/small tasks
, we require fast distributed task management, that does not require
previous resource allocation (as Airflow does), as each ETL task needs
very few resources, and allows tasks to be executed one after the
other immediately.