I have python 3.4 and I installed requests and a few other necessary programs to web scrape. My problem is that I'd like to scrape about 7000 pages (just html/text), and don't want to do it all at once, I'd like to have some kind of delay so I don't hit the servers with too many requests and potentially get banned. I've heard of grequests but apparently they don't have it for python 3.4 (the actual error says it can't find vcvarsall.bat but in the documentation I didn't see any support for 3.4). Does anyone know of an alternative program that could manage the url requests? In other words, I'm not looking to grab everything fast as possible, but rather, slow and steady.
-
possible duplicate http://stackoverflow.com/questions/21978115/using-grequests-to-make-several-thousand-get-requests-to-sourceforge-get-max-r – remudada Aug 14 '14 at 05:45
-
@remudada Thanks, yes, I saw that and if I could install grequests, I could solve my problem, but I can't get it to install on python 3.4, so I was looking for alternative solutions? – thatandrey Aug 14 '14 at 05:53
-
Your question is off-topic, but I can give you an answer: Scrapy http://scrapy.org/ – Maxime Lorant Aug 14 '14 at 07:35
1 Answers
I suggest rolling your own multithreaded program to do requests. I found concurrent.futures
to be the easiest way to multithread these kinds of requests, in particular using the ThreadPoolExecutor
. They even have a simple multithreaded url request example in the documentation.
As for the second part of the question, it really depends on how much/how you want to limit your requests. For me, setting a sufficiently low max_workers
argument and possibly including a time.sleep
wait in my function was enough to avoid any problems even when scraping tens of thousands of pages, but this obviously depends a lot more on the site you're trying to scrape. It shouldn't be hard to implement some kind of batching or waiting though.
The following code is untested but hopefully it can be a starting point. From here, you probably want to modify get_url_data
(or whatever function you're using) with whatever you else you need to do (e.g. parsing, saving).
import concurrent.futures as futures
import requests
from requests.exceptions import HTTPError
urllist = ...
def get_url_data(url, session):
try:
r = session.get(url, timeout=10)
r.raise_for_status()
except HTTPError:
return None
return r.text
s = requests.Session()
try:
with futures.ThreadPoolExecutor(max_workers=5) as ex:
future_to_url = {ex.submit(get_url_data, url, s): url
for url in urlist}
results = {future_to_url[future]: future.result()
for future in futures.as_completed(future_to_url)}
finally:
s.close()

- 4,945
- 31
- 38
-
Thanks, that looks promising, my programming is terribly slow as I'm still a beginner but I'll try this and get back to you by tomorrow! Does this mean I'll be using urllib instead of requests? But I can still use BeautifulSoup to do the parsing, right? – thatandrey Aug 14 '14 at 06:21
-
No, you should still almost certainly use `requests`. I'll see if I can find some old code I used and edit it into the answer. – Roger Fan Aug 14 '14 at 06:28
-
Thanks, that seems like everything! I'm still trying to figure out how to extract the data since I'm also new to CSS references. Is this putting all the html files into "results"? For every item in results, would I be able to run a for loop and do something like test = bs4.BeautifulSoup(results[0].text) ? – thatandrey Aug 14 '14 at 06:47
-
Hopefully! Or, depending on the number of sites and how you want to handle them, you could add any parsing and saving into your version of the `get_url_data` function, in which case you might not need to return a `results` dict at all. – Roger Fan Aug 14 '14 at 06:52
-
Thanks, I'll probably do that. Just as a final question, you didn't seem to use `time.sleep` - if I wanted to, could I just add `sleep(1)` to the end of the `get_url_data` function? – thatandrey Aug 14 '14 at 06:55
-
Yes. Though I'm pretty sure `time.sleep` releases the GIL, so it will allow other threads to run during that time. Alternatively, if you wanted to implement periodic pauses you could divide the urls into batches and do them one batch at a time with pauses in between. – Roger Fan Aug 14 '14 at 06:59