Granularity of tasks in airflow

Question

For one task, there are many helper tasks - fetch/save properties from file/db, validations, audits. These helper methods are not time consuming.

One sample DAG flow,

fetch_data >> actual_processing >> validation >> save_data >> audit

What's the recommendation in this scenario:

create one task for each helper task
keep everything in one task?

What's the overhead of an airflow task assuming there are enough resources?

Always try to keep maximum stuff in single task (and preferably have tasks that run for several minutes than few seconds) to (not exhaustive list) **[1]** minimize [scheduling latency](https://stackoverflow.com/questions/63950467/how-can-one-create-somewhat-complex-airflow-branch-operators#comment113093795_63950467) **[2]** minimize load on `scheduler` / `webserver` / `SQLAlchemy` backend db. — y2k-shubham, Sep 23 '20 at 12:17
Exception to this rule could be (not exhaustive list) **[1]** when idempotency dictates that you must break your tasks into smaller steps to prevent wasteful re-computation / breaking of workflow **[2]** peculiar cases such as if you are using `pool` to limit load on an external resource => in this case, each operation that touches that external resource has to be modelled as a separate task in order to enforce load-restriction via `pool`s — y2k-shubham, Sep 23 '20 at 12:17

y2k-shubham · Accepted Answer · 2020-09-23T12:36:03.560

Question-1

What's the recommendation in this scenario

Always try to keep maximum stuff in single task (and preferably have fat tasks that run for several minutes than lean tasks running for few seconds) to (not exhaustive list)

1. minimize scheduling latency
2. minimize load on scheduler / webserver / SQLAlchemy backend db.

Exception to this rule could be (not exhaustive list)

1. when idempotency dictates that you must break your tasks into smaller steps to prevent wasteful re-computation / breaking of workflow as told in Using Operators doc

An operator represents a single, ideally idempotent, task

2. peculiar cases such as if you are using pools to limit load on an external resource => in this case, each operation that touches that external resource has to be modelled as a separate task in order to enforce load-restriction via pools

Question-2

What's the overhead of an airflow task assuming there are enough resources?

While I can't provide a technically precise answer here, do understand that Airflow's scheduler essentially works on a poll-based approach

at every heartbeat (usually ~ 20-30 s), it scans meta-db and DagBag to determine the list of tasks that are ready to run for e.g. like
- a scheduled task who's upstream tasks have been run
- an up_for_retry task who's retry_delay has expired

From the old docs

The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been met. Behind the scenes, it monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) inspects active tasks to see whether they can be triggered.

this means that having more tasks (as well as more connections / dependencies between them) will increase the workload of scheduler (more checks to be evaluated)

Granularity of tasks in airflow

1 Answers1

Linked