1

I've written a script in python using asyncio association with aiohttp library to parse the names out of pop up boxes initiated upon clicking on contact info buttons out of diffetent agency information located within a table from this website asynchronously. The webpage displayes the tabular contents across 513 pages.

I encountered this error too many file descriptors in select() when I tried with asyncio.get_event_loop() but when I came across this thread I could see that there is a suggestion to use asyncio.ProactorEventLoop() to avoid such error so I used the latter but noticed that, even when I complied with the suggestion, the script collects the names only from few pages until it throws the following error. How can i fix this?

raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.tursab.org.tr:443 ssl:None [The semaphore timeout period has expired]

This is my try so far with:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

async def get_links(url):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await process_docs(text)
            return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
                sauce = BeautifulSoup(text,"lxml")
                try:
                    name = sauce.select_one("p > b").text
                except Exception: name = ""
                print(name)

if __name__ == '__main__':
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))

In short, What the process_docs() function does is collect data-id numbers from each pages to reuse them as the prefix of this https://www.tursab.org.tr/en/displayAcenta?AID={} link to collect the names from pop up boxes. One such id is 8757 and one such qualified links therefore https://www.tursab.org.tr/en/displayAcenta?AID=8757.

Btw, If I change the highest number used in the links variable to 20 or 30 or so, It goes smoothly.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
SIM
  • 21,997
  • 5
  • 37
  • 109
  • Are you sure that that is the correct way to use `asyncio.Semaphore`? Your code creates a `Semaphore` with value 10 and only acquires it once, so it basically does nothing. You probably want to create the `Semaphore` *outside* those functions and pass it around to all the `get_links` calls instead... – Bakuriu Dec 12 '18 at 16:51
  • Thanks for your comment @Bakuriu. I found that idea (using semaphore in such way) from [this post](https://stackoverflow.com/questions/53718961/script-performs-very-slowly-even-when-it-runs-asynchronously). – SIM Dec 12 '18 at 16:55
  • Check the code you linked in your last comment carefully! You are **not** doing the same thing – Bakuriu Dec 12 '18 at 16:56

1 Answers1

3
async def get_links(url):
    async with asyncio.Semaphore(10):

You can't do such a thing: it means that on each function call new semaphore instance will be created, while you need to single semaphore instance for all requests. Change your code this way:

sem = asyncio.Semaphore(10)  # module level

async def get_links(url):
    async with sem:
        # ...


async def fetch_again(link):
    async with sem:
        # ...

You can also return default loop once you're using semaphore correctly:

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(...)

Finally, you should alter both get_links(url) and fetch_again(link) to do parsing outside of semaphore to release it as soon as possible, before semaphore is needed inside process_docs(text).

Final code:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

sem = asyncio.Semaphore(10)

async def get_links(url):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
    result = await process_docs(text)
    return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
    sauce = BeautifulSoup(text,"lxml")
    try:
        name = sauce.select_one("p > b").text
    except Exception:
        name = "o"
    print(name)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))
Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159
  • Sorry to have not been able to comply with your guideline about using semaphore in my script @Mikhail Gerasimov. I ran your script and encountered this viscious error `too many file descriptors in select()`. Moreover, it runs ver very slowly. Plus one for the correct usage of semaphore. – SIM Dec 12 '18 at 17:15
  • @asmitu "it runs ver very slowly" - do you mean the exact code from this answer? It's not slow, it just doesn't always find selectors. To see it add prints after both `await response.text()`. If you still think it's slow, you can increase value you init semaphore with, for example - `asyncio.Semaphore(20)`. I can't reproduce `too many file descriptors in select()`, can you specify time script works before this error occurs in your case? – Mikhail Gerasimov Dec 12 '18 at 17:38
  • Thanks for your solution @Mikhail Gerasimov. Should I stick to the way you indented your script? I was asking this because the way I indented in my above script was intentional cause I tried to follow how [the last example given in this blog](http://edmundmartin.com/concurrent-crawling-in-python/). Will be very happy to hear anything from you on this. Btw, this error `too many file descriptors in select()` still persists. FYI, I'm on windows 32. – SIM Dec 12 '18 at 19:15
  • @asmitu I didn't pay much attention to indentation, you can do it as you see fit. Usually good idea is to follow [PEP 8](https://docs.python-guide.org/writing/style/). "this error ... still persists" - unfortunately I can't reproduce it, nor can I see why it should appear. Only possible thought I have so far is that some previous run of older scripts versions acquired most of the descriptors. Did you try to restart the computer before running final script version? – Mikhail Gerasimov Dec 14 '18 at 07:14