How to efficiently manage resources on a single machine with Airflow

Question

I am running an Airflow process with +400 tasks on an early 2015 MacBook Pro with a 3.1 GHz Intel Core i7 processor and 16GBs or RAM.

The script I'm running looks pretty much like this, with the difference that I have my DAG defined as

default_args = {
  'start_date': datetime.now(),
  'max_active_runs': 2
}

to try to avoid firing too many tasks in parallel. What follows are a series of screenshots of my experience doing this. My questions here are:

This operation generates an enormous number of python processes. Is it necessary to define the entire task queue in RAM this way, or can airflow take a "generate tasks as we go" approach that avoids firing up so many processes.
I would thought that max_active_runs control how many processes are actually doing work at any given time. Reviewing my tasks though, I'll have dozens of tasks occupying CPU resources while the rest are idle. This is really inefficient, how can I control this behavior?

Here are a few screenshots:

Things get off to a good enough start, there's just a lot more processes running in parallel than I anticipated:

Everything bogs down and there's a lot of idle processes. Things seem to grind to a halt:

The terminal starts spitting out tons of error messages and there's a lot of process failure:

The process basically cycles through these phases until it finishes. The final task breakdown looks like this:

[2017-08-24 16:26:20,171] {jobs.py:2066} INFO - [backfill progress] | finished run 1 of 1 | tasks waiting: 0 | succeeded: 213 | kicked_off: 0 | failed: 200 | skipped: 0 | deadlocked: 0 | not ready: 0

Any thoughts?

We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an @hourly DAG would never get to an hour after now as now() moves along. https://airflow.incubator.apache.org/faq.html — Fokko Driesprong, Aug 29 '17 at 13:38

score 1 · Accepted Answer · answered Aug 29 '17 at 13:46

1

The max_active_runs defines how many runs will be scheduled per DAG by Airflow. Depending which executor you are using, the executor has a specific capacity. For example, for the LocalExecutor which is most popular, this is set by the parallelism. This is the number of concurrent tasks the LocalExecutor should run. If you want to constrain the number of parallel runs, you should make use of a pool.

answered Aug 29 '17 at 13:46

Fokko Driesprong

2,075
19
31

I do have a follow-up question, however. `pool` seems to work by assigning tasks to different "blocks" of tasks which are the pools. Is there a way, at the level of the DAG, to restrict the parallelism? And where are pools instantiated? – aaron Aug 29 '17 at 16:30

How to efficiently manage resources on a single machine with Airflow

Here are a few screenshots:

1 Answers1