10

Note: Future readers be aware, this question was old, formatted and programmed in a rush. The answer given may be useful, but the question and code probably not.

Hello everyone,

I'm having trouble understanding asyncio and aiohttp and making both work together. Because I don't understand what I'm doing I've run into a problem that I have no idea how to solve.

I'm using Windows 10 64 bits.

The following code returns a list of pages that do not contain "html" in the Content-Type header. It's implemented using asyncio.

import asyncio
import aiohttp

MAXitems = 30

async def getHeaders(url, session, sema):
    async with session:
        async with sema:
            try:
                async with session.head(url) as response:
                    try:
                        if "html" in response.headers["Content-Type"]:
                            return url, True
                        else:
                            return url, False
                    except:
                        return url, False
            except:
                return url, False


def check_urls_without_html(list_of_urls):
    headers_without_html = set()
    while(len(list_of_urls) != 0):
        blockurls = []
        print(len(list_of_urls))
        items = 0
        for num in range(0, len(list_of_urls)):
            if num < MAXitems:
                blockurls.append(list_of_urls[num - items])
                list_of_urls.remove(list_of_urls[num - items])
                items += 1
        loop = asyncio.get_event_loop()
        semaphoreHeaders = asyncio.Semaphore(50)
        session = aiohttp.ClientSession()
        data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, session, semaphoreHeaders) for url in blockurls)))
        for header in data:
            if not header[1]:
                headers_without_html.add(header)
    return headers_without_html


list_of_urls= ['http://www.google.com', 'http://www.reddit.com']
headers_without_html =  check_urls_without_html(list_of_urls)

for header in headers_without_html:
    print(header[0])

When I run it with too many URLs (ie 2000) sometimes it returns an error like like this one:

data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, session, semaphoreHeaders) for url in blockurls)))
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 454, in run_until_complete
    self.run_forever()
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 421, in run_forever
    self._run_once()
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 1390, in _run_once
    event_list = self._selector.select(timeout)
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\selectors.py", line 323, in select
    r, w, _ = self._select(self._readers, self._writers, [], timeout)
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\selectors.py", line 314, in _select
    r, w, x = select.select(r, w, w, timeout)
ValueError: too many file descriptors in select()

I've read that problem arises from a Windows' restriction. I've also read there is not much that can be done about it, other than trying to use less file descriptors.

I've seen people push thousands of requests with asyncio and aiohttp but even with my chuncking I can't push 30-50 without getting this error.

Is there something fundamentally wrong with my code or is it an inherent problem with Windows? Can it be fixed? Can one increase the limit on the maximum number of allowed file descriptors in select?

Josep
  • 676
  • 2
  • 8
  • 14

3 Answers3

16

By default Windows can use only 64 sockets in asyncio loop. This is a limitation of underlying select() API call.

To increase the limit please use ProactorEventLoop, you can use the code below. See the full docs here here.

if sys.platform == 'win32':
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)

Another solution is to limit the overall concurrency using a sempahore, see the answer provided here. For example, when doing 2000 API calls you might want not want too many parallel open requests (they might timeout / more difficult to see the individual calling times). This will give you

await gather_with_concurrency(100, *my_coroutines)
Roelant
  • 4,508
  • 1
  • 32
  • 62
Andrew Svetlov
  • 16,730
  • 8
  • 66
  • 69
  • Thanks! Few questions: a) How is it possible that I'm reaching the limit with the chunking that I'm doing? b) I've been trying from win32file import _setmaxstdio _setmaxstdio(3072) since I posted my original question, it seems to work preeetty well, does it do the same as ProactorEventLoop? – Josep Dec 07 '17 at 10:17
  • a) yes. aiohttp has limit 100 for concurrent connection number bu default, it's bigger than 64. b) I don't use Windows and cannot check `_setmaxstdio` but the limit is compile-time macro imho. It cannot be changed in runtime. – Andrew Svetlov Dec 07 '17 at 18:39
  • a)I don't see how that actually answers my question. I asked how it is possible to saturate the select if I'm chunking the data and this is the only script that I'm running with aiohttp. Unless aiohttp uses multiple connections per link. b)Well, I'm going to try your solution later, but _setmaxstdio does make a difference. Without it it bugged out every 1000 links give or take, it has done more than 5000+ with it without problem and the only reason it stopped is because I interrupted it. Edit: Anyways, I gave you the rick for answering my question, thanks. – Josep Dec 07 '17 at 21:59
  • You wrote that you process about 2000 URLs. HTTP connection is returned to internal pool but not released immediately. That's how you can run out of limit for opened sockets. – Andrew Svetlov Dec 08 '17 at 09:51
  • 1
    I've changed: loop = asyncio.get_event_loop() for: loop = asyncio.ProactorEventLoop() asyncio.set_event_loop(loop) and it raises: raise NotImplementedError in File "\Python\Python36-32\lib\site-packages\aiodns\__init__.py", line 85, in _sock_state_cb self.loop.add_reader(fd, self._handle_event, fd, READ) File "\Python\Python36-32\lib\asyncio\events.py", line 453, in add_reader Edit: I have no idea how to format comments, excuse me – Josep Dec 09 '17 at 20:49
  • `add_reader` is not supported by `ProactorEventLoop`. Looks like `aiodns` doens't intended to work on Windows. Sorry, I'm not the library author. Maybe you need an issue on aiodns bugtracker? – Andrew Svetlov Dec 10 '17 at 08:01
  • Thanks for your help, I've opened an issue in the bugtracker of aionds, however I've been trying to workaround that temporarily by uninstalling it and running without async resolver. I've encountered another error that looks to me 100% asyncio scope. Should I open a ticket on asyncio to discuss it better? – Josep Dec 10 '17 at 08:42
  • https://bugs.python.org/ is the proper place for asyncio issues. Please make sure that `Component` is `asyncio` for new bugs. – Andrew Svetlov Dec 10 '17 at 08:47
7

I'm having the same problem. Not 100% sure that this is guaranteed to work, but try replacing this:

session = aiohttp.ClientSession()

with this:

connector = aiohttp.TCPConnector(limit=60)
session = aiohttp.ClientSession(connector=connector)

By default limit is set to 100 (docs), meaning that the client can have 100 simultaneous connections open at a time. As Andrew mentioned, Windows can only have 64 sockets open at a time, so we provide a number lower than 64 instead.

James Ko
  • 32,215
  • 30
  • 128
  • 239
1
#Add to call area
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)