2

I have a coroutine sending web requests and post process them. Currently I'm doing:

async def scrape(url, sess, logging=None):
    # request
    result = sess.get(url, headers=headers(url))
    # process
    if result.ok:
        await(post_process(result.content))

async def main():
    # code here
    for url in urls:
        await asyncio.create_task(scrape(url, sess))

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

The problem is it is running slowly! Seems the requests are blocking the loop. How can I turn requests to coroutines and wait for it to complete?

Note: I did google it. But I haven't find a concise example to do this. Also, I am using requests and some searches show it's causing the trouble.

knh190
  • 2,744
  • 1
  • 16
  • 30

1 Answers1

0

Fundamentally requests is a synchronous library, and so when you call

sess.get()

that is going to block the current thread, the one that is executing the asyncio runloop, until it completes. That means any other asynchronous tasks scheduled on that runloop will have to wait until it's complete before running and that you have lost any advantage of using asyncio, because you are using requests, which is fully synchronous.

I think you wouldn't have asked the question if your code had been:

async def scrape(url, sess, logging=None):
    # call some extremely long running and CPU intensive function
    do_some_serious_work()
    # process
    if result.ok:
        await(post_process(result.content))

because you'd have understood that your scrape() function is occupying the whole thread by running do_some_serious_work() and you probably understand that this is not what asyncio is designed for. But it's the same situation with your code, only instead of being "CPU-bound" it is "IO-bound" in that it is having to wait for some network requests to resolve before it can allow the thread to continue.

So as pointed out in the comments you should use a network library that is built with asyncio in mind, such as aiohttp, and that will have asynchronous methods that can yield to other async tasks while the network request is running in the background.

daphtdazz
  • 7,754
  • 34
  • 54