What is the overhead of an asyncio task?

Question

What is the overhead of any asyncio task in terms of memory and speed? Is it ever worth minimising the number of tasks in cases when they don’t need to run concurrently?

That's a rather broad question… the question is, is it *efficient enough for you*? Executing the same tasks serially probably means the entire operation will take longer; whereas executing them asynchronously potentially finishes them all much quicker. Of course there's a resources vs. time tradeoff. You need to figure out which resource is more precious to you and which you can afford to spend, and how much. You do that best with benchmarking actual code. — deceze, Apr 19 '19 at 12:16
In relation to what? threads? normal functions? processes? all? — Charming Robot, Apr 20 '19 at 03:38
@deceze The phrase "is it efficient enough for you?" is one of the most frustrating to appear on SO. Developers frequently cannot answer that, but knowing the relative overhead of things is vital for avoiding "performance death by 1000 cuts". There's only three resources in play for this question: CPU time, memory, end-to-end time. And since the question is asking only the _overhead_ not a wider discussion of tradeoff, it seems easily answerable and not at all too broad. — Philip Couling, Mar 13 '23 at 15:26

user4815162342 · Accepted Answer · 2019-04-20T11:09:30.910

What is the overhead of any asyncio task in terms of memory and speed?

TL;DR The memory overhead appears negligible, but the time overhead can be significant, especially when the awaited coroutine chooses not to suspend.

Let's assume you are measuring the overhead of a task compared to a directly awaited coroutine, e.g.:

await some_coro()                       # (1)
await asyncio.create_task(some_coro())  # (2)

There is no reason to write (2) directly, but creating an unnecessary task can easily arise when using APIs that automatically "futurize" the awaitables they receive, such as asyncio.gather or asyncio.wait_for. (I suspect that building or use of such an abstraction is in the background of this question.)

It is straightforward to measure the memory and time difference between the two variants. For example, the following program creates a million tasks, and the memory consumption of the process can be divided by a million to get an estimate of the memory cost of a task:

async def noop():
    pass

async def mem1():
    tasks = [asyncio.create_task(noop()) for _ in range(1000000)]
    time.sleep(60)  # not asyncio.sleep() in this case - we don't
                    # want our noop tasks to exit immediately

On my 64-bit Linux machine running Python 3.7, the process consumes approximately 1 GiB of memory. That's about 1 KiB per task+coroutine, and it counts both the memory for the task and the memory for its entry in the event loop bookkeeping. The following program measures an approximation of the overhead of just a coroutine:

async def mem2():
    coros = [noop() for _ in range(1000000)]
    time.sleep(60)

The above process takes about 550 MiB of memory, or 0.55 KiB per coroutine only. So it seems that while a task isn't exactly free, it doesn't impose a huge memory overhead over a coroutine, especially keeping in mind that the above coroutine was empty. If the coroutine had some state, the overhead would have been much smaller (in relative terms).

But what about the CPU overhead - how long does it take to create and await a task compared to just awaiting a coroutine? Let's try a simple measurement:

async def cpu1():
    t0 = time.time()
    for _ in range(1000000):
        await asyncio.create_task(noop())
    t1 = time.time()
    print(t1-t0)

On my machine this takes 27 seconds (on average, with very small variations) to run. The version without a task would look like this:

async def cpu2():
    t0 = time.time()
    for _ in range(1000000):
        await noop()
    t1 = time.time()
    print(t1-t0)

This one takes only 0.16 seconds, a factor of ~170! So it turns out that the time overhead of awaiting a task is non-negligible compared to awaiting a coroutine object. This is for two reasons:

Tasks are more expensive to create than coroutine objects, because they require initializing the base Future, then the properties of the Task itself, and finally inserting the task into the event loop, with its own bookkeeping.
A freshly created task is in a pending state, its constructor having scheduled it to start executing the coroutine at the first opportunity. Since the task owns the coroutine object, awaiting a fresh task cannot just start executing the coroutine; it has to suspend and wait for the task to get around to executing it. The awaiting coroutine will only resume after a full event loop iteration, even when awaiting a coroutine that chooses not to suspend at all! An event loop iteration is expensive because it goes through all runnable tasks and polls the kernel for IO and timeout activities. Indeed, strace of cpu1 shows two million calls to epoll_wait(2). cpu2 on the other hand only goes to the kernel for the occasional allocation-related mmap(), a couple thousand in total.

In contrast, directly awaiting a coroutine doesn't yield to the event loop unless the awaited coroutine itself decides to suspend. Instead, it immediately goes ahead and starts executing the coroutine as if it were an ordinary function.

So, if your coroutine's happy path does not involve suspending (as is the case with non-contended sychronization primitives or with stream reading from a non-blocking socket that has data to provide), the cost of awaiting it is comparable to the cost of a function call. That is much faster than an event loop iteration required to awaiting a task, and can make a difference when latency matters.

Thanks for all the detail... A question though, does ` coros = [noop() for _ in range(1000000)]` actually schedule all the `noop`s to run? — Michal Charemza, Apr 19 '19 at 19:27
@MichalCharemza It doesn't, the automatic scheduling is a property of the higher-level `Task`, not of the lower-level coroutine object. In the memory benchmark the creation of a million of them only serves to make the memory usage apparent, without pretense that the run-time semantics of actually awaiting them would be the same. — user4815162342, Apr 19 '19 at 19:40
Suspending seems to be most significant part here: if I alter code to `async def noop(): asyncio.sleep(0)` I get `10 sec.` vs `30 sec.`. I'm not sure I'm buying argument about `coroutine is simple enough`: there's no need to create coroutine if it's not going to suspend, especially millions of them. Still, thanks for research! — Mikhail Gerasimov, Apr 20 '19 at 02:24
@MikhailGerasimov *there's no need to create coroutine if it's not going to suspend* I'm not considering a coroutine that's **never** going to suspend, but one that might not suspend **typically**. The answer mentions `stream.read()` as an example which works exactly like that, but there are other examples, such as `queue.get` and `queue.put`, the `__aenter__` methods on many async context managers, the synchronization methods in the non-contended case, and so on. There are many low-level coroutines that don't suspend every time when awaited. — user4815162342, Apr 20 '19 at 06:42
@MikhailGerasimov right to be skeptical. Even the observed that this method of measuring "overhead" is misleading, perhaps plain wrong. You are measuring the end-to-end time, not the resource usage time. The method that creates a task spreads all work across 2,000,000 iterations of the event loop. Where the direct `await` get's the whole thing done in just 1 iteration of the event loop. So even @MikhailGerasimov's `10 sec` vs `30 sec` is wrong because it spreads it across 1,000,000 iterations of the event loop vs 3,000,000 iterations of the event loop. — Philip Couling, Mar 13 '23 at 15:11

Mikhail Gerasimov · Answer 2 · 2019-04-19T16:56:17.690

Task itself is just a tiny Python object. It requires miserable amount of memory and CPU. Operation that being run by Task (Task usually runs a coroutine) on the other hand may consume its own noticeable resources, for example:

network bandwidth if we talking about network operations (network read/write)
CPU/memory if we talking about operation being run in separate process using run_in_executor

Usually(*) you don't have to think about number of tasks in the same way as, for example, you don't usually think about number of function calls in your Python script.

But of course you should always think about how your async program works in general. If it's going to make much simultaneous I/O requests or spawn much simultaneous threads/processes you should use Semaphore to avoid too many resources being acquired simultaneously.

(*) unless you're doing something very special and planning to create billions of tasks. In this case you should create them lazily using Queue or something similar.

What is the overhead of an asyncio task?

2 Answers2

Linked