Python synchronous code example faster than async

Question

I was migrating a production system to async when I realized the synchronous version is 20x faster than the async version. I was able to create a very simple example to demonstrate this in a repeatable way;

Asynchronous Version

import asyncio, time

data = {}

async def process_usage(key):
    data[key] = key

async def main():
    await asyncio.gather(*(process_usage(key) for key in range(0,1000000)))

s = time.perf_counter()
results = asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")

This takes 19 seconds. The code loops through 1M keys and builds a dictionary, data with the same key and value.

$ python3.7 async_test.py
Took 19.08 seconds.

Synchronous Version

import time

data = {}

def process_usage(key):
    data[key] = key

def main():
    for key in range(0,1000000):
        process_usage(key)

s = time.perf_counter()
results = main()
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")

This takes 0.17 seconds! And does exactly the same thing as above.

$ python3.7 test.py
Took 0.17 seconds.

Asynchronous Version with create_task

import asyncio, time

data = {}

async def process_usage(key):
    data[key] = key

async def main():
    for key in range(0,1000000):
        asyncio.create_task(process_usage(key))

s = time.perf_counter()
results = asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")

This version brings it down to 11 seconds.

$ python3.7 async_test2.py
Took 11.91 seconds.

Why does this happen?

In my production code I will have a blocking call in process_usage where I save the value of key to a redis database.

Well for one, your asynchronous code has to generate a function call with 1 million arguments, which will require loading that into memory. Whereas your synchronous code just uses the efficient `range()` iterator — Kyle Willmon, May 07 '19 at 16:01
@KyleWillmon I'm new to async is there a better way to do this? In production I also have to loop through 1M keys but from a database not the range function. — Jonathan, May 07 '19 at 16:04
As far as I know, you're always going to need quite a bit of overhead to keep track of 1 million coroutines. However, 19 seconds does seem excessive for this trivial example. Perhaps someone else can explain more about that. — Kyle Willmon, May 07 '19 at 16:14
I tried generating the function calls outside the main & it seems that the call generation part is taking around 10 secs out of those 20. The rest is async overhead due to all the coroutines. You might have better luck with your actual code since 1) You would already have all the arguments to the function in memmory & 2) The `process` function would not be a trivial CPU bound method (hopefully). In case your process is actually CPU bound, you'll be better off using a process pool. — rdas, May 07 '19 at 16:19
I've added an example with `create_task` that brings it down to 11 seconds. That would be the best option at the moment. However my script makes heavy use of redis and I'd like to use aioredis from within process_usage but I can't do that if it's not async. — Jonathan, May 07 '19 at 16:29
Why would you expect asyncio to be faster here? Your doing completely cpu bound work. — juanpa.arrivillaga, May 07 '19 at 17:23
@juanpa.arrivillaga In my production code I'm doing a database write in process_usage and I see the same behavior. What you see in the post is an example. — Jonathan, May 07 '19 at 17:27
But then your benchmark has no bearing on what you care about. — juanpa.arrivillaga, May 07 '19 at 17:28
Feel free to edit the post with a better example of this behavior. — Jonathan, May 07 '19 at 17:31

user4815162342 · Accepted Answer · 2019-05-08T06:25:02.060

When comparing those benchmarks one should note that the asynchronous version is, well, asynchronous: asyncio spends a considerable effort to ensure that the coroutines you submit can run concurrently. In your particular case they don't actually run concurrently because process_usage doesn't await anything, but the system doesn't actually that. The synchronous version on the other hand makes no such provisions: it just runs everything sequentially, hitting the happy path of the interpreter.

A more reasonable comparison would be for the synchronous version to try to parallelize things in the way idiomatic for synchronous code: by using threads. Of course, you won't be able to create a separate thread for each process_usage because, unlike asyncio with its tasks, the OS won't allow you to create a million threads. But you can create a thread pool and feed it tasks:

def main():
    with concurrent.futures.ThreadPoolExecutor() as executor:
        for key in range(0,1000000):
            executor.submit(process_usage, key)
        # at the end of "with" the executor automatically
        # waits for all futures to finish

On my system this takes ~17s, whereas the asyncio version takes ~18s. (The faster asyncio version takes ~13s.)

If the speed gain of asyncio is so small, one could ask why bother with asyncio? The difference is that with asyncio, assuming idiomatic code and IO-bound coroutines, you have at your disposal a virtually unlimited number of tasks that in a very real sense execute concurrently. You can create tens of thousands of asynchronous connections at the same time, and asyncio will happily juggle them all at once, using a high-quality poller and a scalable coroutine scheduler. With a thread pool the number of tasks executed in parallel is always limited by the number of threads in the pool, typically in the hundreds at most.

Even toy examples have value, for learning if nothing else. If you are using such microbenchmarks to make decisions, I suggest investing some more effort to give the examples more realism. The coroutine in the asyncio example should contain at least one await, and the sync example should use threads to emulate the same amount of parallelism you obtain with async. If you adjust both to match your actual use case, then the benchmark actually puts you in a position to make a (more) informed decision.

Thanks, this helped me better understand why this happens. My actual production function does an async write to redis with aioredis but I now understand the source of the overhead. — Jonathan, May 07 '19 at 17:55
@Jonathan It would be interesting to examine your original problem in more detail. It's far from clear why parallel asyncio connections to redis would fare slower than the same number of sequential connections, except redis itself getting overwhelmed and undeperforming. Perhaps the performance of your code would be best improved through judicious use of semaphores or a queue feeding a fixed number of workers. Creating a huge number of concurrent tasks is *possible* in asyncio, but it doesn't mean that it's the optimal approach for every problem. — user4815162342, May 07 '19 at 18:05
It basically read usage data for each key from a dict and if usage was above eg. 1000 then it would write the key to a `rate_limit` redis db. It's really that simple. The synchronous version takes 1s, I'm trying out a mix of synchronous for that and async for the rest of the script (to do a number of batch writes to dynamodb). Hope that helps. — Jonathan, May 07 '19 at 18:10

score 2 · Answer 2 · answered May 07 '19 at 17:20

Why does this happen?

TL;DR

Because using asyncio itself doesn't speedup code. You need multiple gathered network I/O related operations to see the difference toward synchronous version.

Detailed

asyncio is not a magic that allows you to speedup arbitrary code. With or without asyncio your code is still being run by CPU with limit performance.

asyncio is a way to manage multiple execution flows (coroutines) in a nice, clear way. Multiple execution flows allow you to start next I/O-related operation (such as request to database) before waiting for other one to be completed. Please read this answer for more detailed explanation.

Please also read this answer for explanation when it makes sense to use asyncio.

Once you start to use asyncio right way overhead for using it should be much lower than benefits you get for parallelizing I/O operations.

Python synchronous code example faster than async

2 Answers2