Update: Problem was incomplete documentation, event dispatcher passing kwargs to the hook function.
I have a list of about 30k URLs that I want to check for various strings. I have a working version of this script using Requests & BeautifulSoup, but it doesn't use threading or asynchronous requests so it's incredibly slow.
Ultimately what I would like to do is cache the html for each URL so I can run multiple checks without making redundant HTTP requests to each site. If I have a function that will store the html, what's the best way to asynchronously send the HTTP GET requests and then pass the response objects?
I've been trying to use Grequests (as described here) and the "hooks" parameter, but I'm getting errors and the documentation doesn't go very in-depth. So I'm hoping someone with more experience can shed some light.
Here's a simplified example of what I'm trying to accomplish:
import grequests
urls = ['http://www.google.com/finance','http://finance.yahoo.com/','http://www.bloomberg.com/']
def print_url(r):
print r.url
def async(url_list):
sites = []
for u in url_list:
rs = grequests.get(u, hooks=dict(response=print_url))
sites.append(rs)
return grequests.map(sites)
print async(urls)
And it produces the following TypeError:
TypeError: print_url() got an unexpected keyword argument 'verify'
<Greenlet at 0x32803d8L: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x00000000028D2160>>
(stream=False)> failed with TypeError
Not sure why it's sending 'verify' as a keyword argument by default; it would be great to get something working though, so if anyone has any suggestions (using grequests or otherwise) please share :)
Thanks in advance.