9

tl;dr: how do I maximize number of http requests I can send in parallel?

I am fetching data from multiple urls with aiohttp library. I'm testing its performance and I've observed that somewhere in the process there is a bottleneck, where running more urls at once just doesn't help.

I am using this code:

import asyncio
import aiohttp

async def fetch(url, session):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'}
    try:
        async with session.get(
            url, headers=headers, 
            ssl = False, 
            timeout = aiohttp.ClientTimeout(
                total=None, 
                sock_connect = 10, 
                sock_read = 10
            )
        ) as response:
            content = await response.read()
            return (url, 'OK', content)
    except Exception as e:
        print(e)
        return (url, 'ERROR', str(e))

async def run(url_list):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in url_list:
            task = asyncio.ensure_future(fetch(url, session))
            tasks.append(task)
        responses = asyncio.gather(*tasks)
        await responses
    return responses

loop = asyncio.get_event_loop()
asyncio.set_event_loop(loop)
task = asyncio.ensure_future(run(url_list))
loop.run_until_complete(task)
result = task.result().result()

Running this with url_list of varying length (tests against https://httpbin.org/delay/2) I see that adding more urls to be run at once helps only up to ~100 urls and then total time starts to grow proportionally to number of urls (or in other words, time per one url does not decrease). This suggests that something fails when trying to process these at once. In addition, with more urls in 'one batch' I am occasionally receiving connection timeout errors.

enter image description here

  • Why is it happening? What exactly limits the speed here?
  • How can I check what is the maximum number of parallel requests I can send on a given computer? (I mean an exact number - not approx by 'trial and error' as above)
  • What can I do to increase the number of requests processed at once?

I am runnig this on Windows.

EDIT in response to comment:

This is the same data with limit set to None. Only slight improvement in the end and there are many connection timeout errors with 400 urls sent at once. I ended up using limit = 200 on my actual data.

enter image description here

pieca
  • 2,463
  • 1
  • 16
  • 34
  • 1
    It'd be really interesting to see the updated graph with the artificial limit removed. Could you perhaps edit the question to include it? – user4815162342 Mar 20 '19 at 19:46
  • 1
    @user4815162342 updated – pieca Mar 21 '19 at 10:03
  • @pieca I'm not sure when aiohttp starts timeout timer, so instead of limiting connections you may want to leave it `=None` and use semaphore to limit silmuntanious requests number instead. [Here's example](https://stackoverflow.com/a/55270554/1113207) of how it can be done. It may improve performance and reduce errors. – Mikhail Gerasimov Mar 21 '19 at 10:55
  • @MikhailGerasimov thanks for the link, I'll try to run it that way – pieca Mar 21 '19 at 11:04
  • Thanks! This is good to know. – user4815162342 Mar 21 '19 at 14:28

1 Answers1

11

By default aiohttp limits number of simultaneous connections to 100. It achieves by setting default limit to TCPConnector object that is used by ClientSession. You can bypass it by creating and passing custom connector to session:

connector = aiohttp.TCPConnector(limit=None)
async with aiohttp.ClientSession(connector=connector) as session:
    # ...

Note however that you probably don't want to set this number too high: your network capacity, CPU, RAM and target server have their own limits and try to make enormous amount of connection can lead to increasing failures.

Optimal number can probably be found only through experiments on concrete machine.


Unrelated:

You don't have to create tasks without reason. Most asyncio api accept regular coroutines. For example, your last lines of code can be altered this way:

loop = asyncio.get_event_loop()
loop.run_until_complete(run(url_list))

Or even to just asyncio.run(run(url_list)) (doc) if you're using Python 3.7

Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159