1

I have a list of URL's of websites that I want to download repeatedly (in variable time intervals) using Python. It is necessary to do that asynchronously to cope with a large number of websites and/or long response times.
I've tried many things with event loops, queues, async functions, asyncio, etc., but I do not get it working. The following very simple version downloads the websites repeatedly, but it does not download the websites concurrently - instead the next download only starts after the previous one is finished.

import asyncio
import datetime
import aiohttp

def produce_helper(url: str):
    # helper, because I cannot call an async function with loop.call_later
    loop.create_task(produce(url))

async def produce(url: str):
    await q.put(url)
    print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Produced {url}')

async def consume():
    async with aiohttp.ClientSession() as session:
        while True:
            url = await q.get()
            print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Start: {url}')
            async with session.get(url, timeout=10) as response:
                print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Finished: {url}')
                q.task_done()
                loop.call_later(10, produce_helper, url)

q = asyncio.Queue()
url_list = ["https://www.google.com/", "https://www.bing.com/", "https://www.yelp.com/"]

loop = asyncio.get_event_loop()
for url in url_list:
    loop.create_task(produce(url))
loop.create_task(consume())
loop.run_forever()

Is this a suitable approach for my problem? Is there anything better conceptually?
And how do I accomplish concurrent downloads?
Any help is appreciated.

EDIT:
The challenge (as described in the comment below) is the following: After each successful download, I want to add the respective URL back to the queue - to be due after a specified waiting time (10 s in the example in my question). As soon, as it is due, I want to download the website again, add the URL back to the queue etc.

Cake
  • 177
  • 2
  • 9
  • Does this help? https://stackoverflow.com/questions/35926917/asyncio-web-scraping-101-fetching-multiple-urls-with-aiohttp – Michael H. Feb 13 '18 at 19:53
  • Thanks. Unfortunately, not really. I've done something like that before: Fetch all URL's at once and that's it. However, my challenge now is the following: After each successful download, I want to add the respective URL back to the queue - to be due after a specified waiting time (10 s in the example in my question). As soon, as it is due, I want to download the website again, add the URL back to the queue etc. – Cake Feb 13 '18 at 22:43
  • 2
    You only have one `consume` task running. If you want to allow multiple requests to run at a time, you need to use more than one. – dirn Feb 13 '18 at 22:55
  • @dirn: Thank you. This works. Do you have an opinion about the way I re-add the task to download the website using `loop.call_later(10, produce_helper, url)`. I guess that there is a more elegant way!? – Cake Feb 13 '18 at 23:07
  • 1
    You can use `loop.call_later(10, q.put_nowait, url)` and completely remove the `produce_helper`. Also, since your queue is unbounded (and therefore non-blocking), you don't need `produce` as a separate coroutine at all, just use `q.put_nowait` when you need to put something into the queue. – user4815162342 Feb 14 '18 at 10:11
  • 1
    Also, you don't need the call to `task_done()`. The only purpose of that call is for queues which you plan to `join()`, and that is never the case with your queue which is designed to never get emptied. – user4815162342 Feb 14 '18 at 10:12

0 Answers0