How can I asynchronously request URLs in a growing queue with asyncio?

Question

I have X initial urls that are paginated - in order to get the next set of data, I have to grab the next url from the response header until there is no next url. I am having trouble getting this going right. I'm trying a queue approach that I found here.

import asyncio
from aiohttp import ClientSession, TCPConnector

async def get(session, url):
    headers = {
      'Authorization': 'Bearer  KEY',
     }
     async with session.get(url, headers=headers) as response:
            json = await response.json()
            return json, response

async def process(session, url, q):
    try:      
        try:
            views, response = await get(session, url)
            scode = response.status
            if scode == 404:
                return
        except Exception as e:
            print(e)
            return
        try:
            await q.put(str(response.links["next"]["url"]))
        except:
            pass

        <do something with views>
    except Exception as e:
        print(e)

async def fetch_worker(session, q):
    while True:
        url = await q.get()
        try:
            await process(session, url, q)
        except Exception as e:
            print(e)
        finally:
            q.task_done()

async def d():
    <code to query and put data into stdrows>
    connector = TCPConnector(limit=500)
    async with ClientSession(connector=connector) as session:
        url = '<some base url>'

        for i in range(500):
            tasks.append(asyncio.create_task(fetch_worker(session, url_queue)))

        for row in stdrows:
            await url_queue.put(url.format(row[1]))

        await asyncio.gather(*tasks)
        await url_queue.join()
asyncio.run(d())

This appears not to be going at 500 tasks/sec. is it even possible to get to this rate without knowing all the URLs ahead of time? I am hoping to fetch the next url from whatever initial url (or from its paginated url) while i work with views.

As an aside, using `except Exception` like that is probably a bad idea, see https://stackoverflow.com/questions/54948548/what-is-wrong-with-using-a-bare-except. — AMC, Apr 14 '20 at 01:01
Your code seems correct, except for `except Exception: pass` where you should at least print the exception. The question is, why do you expect precisely 500 tasks per sec? You started 500 workers, ok, but how do you know that the server and the network take exactly 1 request per second from each worker? Maybe you have overworked the server, or maybe it's throttling you? — user4815162342, Apr 14 '20 at 07:53
@user4815162342 I am going by how the endpoint is telling me how much is left in the "bucket" in the response header. It starts out at 700 and instantly drops to 699.8... and stays around there for the rest of the run.This matches up with when I print the URL in `process` - it prints the initial say 24 then it slows down. Each of the initial urls can have >= 0 paginated urls but there are definitely more than 500 generated urls. If I put 2000 connections/tasks, it's still the same time. — Nicholas Tyler, Apr 14 '20 at 12:38

How can I asynchronously request URLs in a growing queue with asyncio?

0 Answers0