aiohttp.TCPConnector (with limit argument) vs asyncio.Semaphore for limiting the number of concurrent connections

Question

I thought I'd like to learn the new python async await syntax and more specifically the asyncio module by making a simple script that allows you to download multiple resources at one.

But now I'm stuck.

While researching I came across two options to limit the number of concurrent requests:

Passing a aiohttp.TCPConnector (with limit argument) to a aiohttp.ClientSession or
Using a asyncio.Semaphore.

Is there a preferred option or can they be used interchangeably if all you want is to limit the number of concurrent connections? Are the (roughly) equal in terms of performance?

Also both seem to have a default value of 100 concurrent connections/operations. If I use only a Semaphore with a limit of lets say 500 will the aiohttp internals lock me down to 100 concurrent connections implicitly?

This is all very new and unclear to me. Please feel free to point out any misunderstandings on my part or flaws in my code.

Here is my code currently containing both options (which should I remove?):

Bonus Questions:

How do I handle (preferably retry x times) coros that threw an error?
What is the best way to save the returned data (inform my DataHandler) as soon as a coro is finished? I don't want it all to be saved at the end because I could start working with the results as soon as possible.

s

import asyncio
from tqdm import tqdm
import uvloop as uvloop
from aiohttp import ClientSession, TCPConnector, BasicAuth

# You can ignore this class
class DummyDataHandler(DataHandler):
    """Takes data and stores it somewhere"""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def take(self, origin_url, data):
        return True

    def done(self):
        return None

class AsyncDownloader(object):
    def __init__(self, concurrent_connections=100, silent=False, data_handler=None, loop_policy=None):

        self.concurrent_connections = concurrent_connections
        self.silent = silent

        self.data_handler = data_handler or DummyDataHandler()

        self.sending_bar = None
        self.receiving_bar = None

        asyncio.set_event_loop_policy(loop_policy or uvloop.EventLoopPolicy())
        self.loop = asyncio.get_event_loop()
        self.semaphore = asyncio.Semaphore(concurrent_connections)

    async def fetch(self, session, url):
        # This is option 1: The semaphore, limiting the number of concurrent coros,
        # thereby limiting the number of concurrent requests.
        with (await self.semaphore):
            async with session.get(url) as response:
                # Bonus Question 1: What is the best way to retry a request that failed?
                resp_task = asyncio.ensure_future(response.read())
                self.sending_bar.update(1)
                resp = await resp_task

                await  response.release()
                if not self.silent:
                    self.receiving_bar.update(1)
                return resp

    async def batch_download(self, urls, auth=None):
        # This is option 2: Limiting the number of open connections directly via the TCPConnector
        conn = TCPConnector(limit=self.concurrent_connections, keepalive_timeout=60)
        async with ClientSession(connector=conn, auth=auth) as session:
            await asyncio.gather(*[asyncio.ensure_future(self.download_and_save(session, url)) for url in urls])

    async def download_and_save(self, session, url):
        content_task = asyncio.ensure_future(self.fetch(session, url))
        content = await content_task
        # Bonus Question 2: This is blocking, I know. Should this be wrapped in another coro
        # or should I use something like asyncio.as_completed in the download function?
        self.data_handler.take(origin_url=url, data=content)

    def download(self, urls, auth=None):
        if isinstance(auth, tuple):
            auth = BasicAuth(*auth)
        print('Running on concurrency level {}'.format(self.concurrent_connections))
        self.sending_bar = tqdm(urls, total=len(urls), desc='Sent    ', unit='requests')
        self.sending_bar.update(0)

        self.receiving_bar = tqdm(urls, total=len(urls), desc='Reveived', unit='requests')
        self.receiving_bar.update(0)

        tasks = self.batch_download(urls, auth)
        self.loop.run_until_complete(tasks)
        return self.data_handler.done()


### call like so ###

URL_PATTERN = 'https://www.example.com/{}.html'

def gen_url(lower=0, upper=None):
    for i in range(lower, upper):
        yield URL_PATTERN.format(i)   

ad = AsyncDownloader(concurrent_connections=30)
data = ad.download([g for g in gen_url(upper=1000)])

I have the same question, looks like they might be able to be used interchangeably https://stackoverflow.com/questions/35196974/aiohttp-set-maximum-number-of-requests-per-second — Glen Thompson, Aug 18 '17 at 23:19
`asyncio.Semaphore` class has only default value of 1 for its internal counter. Check here [`asyncio` Synchronisation Primitives](https://docs.python.org/3/library/asyncio-sync.html#asyncio.Semaphore) It can be increased to a higher value as required, however, your operating system will still have a limit on number of concurrently open files (TCP connections are files in *nix-like systems, including macOS) — Darkfish, Oct 06 '17 at 05:06
For bonus question 2, look at producer-consumer design pattern in software architecture. — Darkfish, Oct 06 '17 at 05:15
Generally I prefer to see a minimal amount of code to describe the problem, but I just discovered tqdm here. No more hand rolled ascii spinners for me, thanks! — Joseph Sheedy, Jul 26 '18 at 16:57

score 3 · Accepted Answer · answered Dec 26 '18 at 13:07

Is there a preferred option?

Yes, see below:

will the aiohttp internals lock me down to 100 concurrent connections implicitly?

Yes, the default value of 100 will lock you down, unless you specify another limit. You can see it in the source here: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/connector.py#L1084

Are they (roughly) equal in terms of performance?

No (but the difference in performance should be negligible), since aiohttp.TCPConnector checks for available connections anyway, wether or not it is surrounded by a Semaphore, using a Semaphore here would be just unnecessary overhead.

How do I handle (preferably retry x times) coros that threw an error?

I don't believe there is a standard way to do so, but one solution would be to wrap your calls in a method like this:

async def retry_requests(...):
    for i in range(5):
        try:
            return (await session.get(...)
        except aiohttp.ClientResponseError:
            pass

score 0 · Answer 2 · answered Jul 22 '19 at 18:36

How do I handle (preferably retry x times) coros that threw an error?

I created a Python decorator to handle that

    def retry(cls, exceptions, tries=3, delay=2, backoff=2):
        """
        Retry calling the decorated function using an exponential backoff. This
        is required in case of requesting Braze API produces any exceptions.

        Args:
            exceptions: The exception to check. may be a tuple of
                exceptions to check.
            tries: Number of times to try (not retry) before giving up.
            delay: Initial delay between retries in seconds.
            backoff: Backoff multiplier (e.g. value of 2 will double the delay
                each retry).
        """

        def deco_retry(func):
            @wraps(func)
            def f_retry(*args, **kwargs):
                mtries, mdelay = tries, delay
                while mtries > 1:
                    try:
                        return func(*args, **kwargs)
                    except exceptions as e:
                        msg = '{}, Retrying in {} seconds...'.format(e, mdelay)
                        if logging:
                            logging.warning(msg)
                        else:
                            print(msg)
                        time.sleep(mdelay)
                        mtries -= 1
                        mdelay *= backoff
                return func(*args, **kwargs)

            return f_retry

        return deco_retry

aiohttp.TCPConnector (with limit argument) vs asyncio.Semaphore for limiting the number of concurrent connections

2 Answers2

Linked