Faster way to make asynchronous requests

Question

I'm trying to use the requests-futures library to send a batch of asynchronous HTTP requests and identify the presence or absence of a specific bytestring in the content of each page.

Here is the synchronous version. Please note the actual site I'm scraping is not Stack Overflow, and the length of URLs is around 20,000 in actuality. In the example below, I'm averaging roughly 1 second wall time per loop, meaning the whole batch would take half a day at this rate.

import timeit
import requests

KEY = b'<meta name="referrer"'

def filter_url(url):
    """Presence or absence of `KEY` in page's content."""
    resp = requests.get(url, stream=True)
    return resp.content.find(KEY) > -1

urls = [
    'https://stackoverflow.com/q/952914/7954504',
    'https://stackoverflow.com/q/48512098/7954504',
    'https://stackoverflow.com/q/48511048/7954504',
    'https://stackoverflow.com/q/48509674/7954504',
    'https://stackoverflow.com/q/15666943/7954504',
    'https://stackoverflow.com/q/48501822/7954504',
    'https://stackoverflow.com/q/48452449/7954504',
    'https://stackoverflow.com/q/48452267/7954504',
    'https://stackoverflow.com/q/48405592/7954504',
    'https://stackoverflow.com/q/48393431/7954504'
    ]

start = timeit.default_timer()
res = [filter_url(url) for url in urls]
print(timeit.default_timer() - start)
# 11.748123944002145

Now, when I go to do this asynchronously:

from requests_futures.sessions import FuturesSession

session = FuturesSession()

def find_multi_reviews(urls):
    resp = [session.get(url).result() for url in urls]
    print(resp)
    return [i.content.find(KEY) > -1 for i in resp] 

start = timeit.default_timer()
res2 = find_multi_reviews(urls)
print(timeit.default_timer() - start)
# 1.1806047540012514

I can get a 10x speedup. This is okay--but can I do better? As of now, I'm still looking at just under 2 hrs of runtime. Are there tricks, such as increasing the number of workers or execute in separate processes, that would lead to a speed improvement here?

multiprocessing module might help. If you have multiple machines, you should look into taskqueues (like celery) — st0le, Feb 02 '18 at 21:17

score 0 · Answer 1 · answered Feb 02 '18 at 23:15

0

If you're IO (network) bound and not CPU bound, you can easily increase the number of threads you're using:

session = FuturesSession(max_workers=30)  
# you can experiment with the optimal number in your system/network

I hope that helps!

answered Feb 02 '18 at 23:15

c-wilson

395
3
11

score 0 · Accepted Answer · answered Feb 03 '18 at 02:49

Upon further investigation it looks like I'm CPU bound rather than network bound in this case.

That led me to believe ProcessPoolExecutor would provide an improvement here. However, what I ended up doing was just building a pared-down version directly with concurrent.futures. This cut the time in half again:

def filter_url(url):
    """Presence or absence of `KEY` in page's content."""
    resp = requests.get(url, stream=True)
    return resp.content.find(KEY) > -1


def main():
    res = []
    with ProcessPoolExecutor() as executor:
        for url, b in zip(urls, executor.map(filter_url, urls)):
            res.append((url, b))
    return res

start = timeit.default_timer()
res = main()
print(timeit.default_timer() - start)
# 0.5077149430002464

Faster way to make asynchronous requests

2 Answers2