I'm trying to use the requests-futures
library to send a batch of asynchronous HTTP requests and identify the presence or absence of a specific bytestring in the content of each page.
Here is the synchronous version. Please note the actual site I'm scraping is not Stack Overflow, and the length of URLs is around 20,000 in actuality. In the example below, I'm averaging roughly 1 second wall time per loop, meaning the whole batch would take half a day at this rate.
import timeit
import requests
KEY = b'<meta name="referrer"'
def filter_url(url):
"""Presence or absence of `KEY` in page's content."""
resp = requests.get(url, stream=True)
return resp.content.find(KEY) > -1
urls = [
'https://stackoverflow.com/q/952914/7954504',
'https://stackoverflow.com/q/48512098/7954504',
'https://stackoverflow.com/q/48511048/7954504',
'https://stackoverflow.com/q/48509674/7954504',
'https://stackoverflow.com/q/15666943/7954504',
'https://stackoverflow.com/q/48501822/7954504',
'https://stackoverflow.com/q/48452449/7954504',
'https://stackoverflow.com/q/48452267/7954504',
'https://stackoverflow.com/q/48405592/7954504',
'https://stackoverflow.com/q/48393431/7954504'
]
start = timeit.default_timer()
res = [filter_url(url) for url in urls]
print(timeit.default_timer() - start)
# 11.748123944002145
Now, when I go to do this asynchronously:
from requests_futures.sessions import FuturesSession
session = FuturesSession()
def find_multi_reviews(urls):
resp = [session.get(url).result() for url in urls]
print(resp)
return [i.content.find(KEY) > -1 for i in resp]
start = timeit.default_timer()
res2 = find_multi_reviews(urls)
print(timeit.default_timer() - start)
# 1.1806047540012514
I can get a 10x speedup. This is okay--but can I do better? As of now, I'm still looking at just under 2 hrs of runtime. Are there tricks, such as increasing the number of workers or execute in separate processes, that would lead to a speed improvement here?