Scraping by sending concurrent requests with python

Question

I have python 3.4 and I installed requests and a few other necessary programs to web scrape. My problem is that I'd like to scrape about 7000 pages (just html/text), and don't want to do it all at once, I'd like to have some kind of delay so I don't hit the servers with too many requests and potentially get banned. I've heard of grequests but apparently they don't have it for python 3.4 (the actual error says it can't find vcvarsall.bat but in the documentation I didn't see any support for 3.4). Does anyone know of an alternative program that could manage the url requests? In other words, I'm not looking to grab everything fast as possible, but rather, slow and steady.

possible duplicate http://stackoverflow.com/questions/21978115/using-grequests-to-make-several-thousand-get-requests-to-sourceforge-get-max-r — remudada, Aug 14 '14 at 05:45
@remudada Thanks, yes, I saw that and if I could install grequests, I could solve my problem, but I can't get it to install on python 3.4, so I was looking for alternative solutions? — thatandrey, Aug 14 '14 at 05:53
Your question is off-topic, but I can give you an answer: Scrapy http://scrapy.org/ — Maxime Lorant, Aug 14 '14 at 07:35

Roger Fan · Accepted Answer · 2014-08-14T06:48:16.103

2

I suggest rolling your own multithreaded program to do requests. I found concurrent.futures to be the easiest way to multithread these kinds of requests, in particular using the ThreadPoolExecutor. They even have a simple multithreaded url request example in the documentation.

As for the second part of the question, it really depends on how much/how you want to limit your requests. For me, setting a sufficiently low max_workers argument and possibly including a time.sleep wait in my function was enough to avoid any problems even when scraping tens of thousands of pages, but this obviously depends a lot more on the site you're trying to scrape. It shouldn't be hard to implement some kind of batching or waiting though.

The following code is untested but hopefully it can be a starting point. From here, you probably want to modify get_url_data (or whatever function you're using) with whatever you else you need to do (e.g. parsing, saving).

import concurrent.futures as futures
import requests
from requests.exceptions import HTTPError

urllist = ...

def get_url_data(url, session):
    try:
        r = session.get(url, timeout=10)
        r.raise_for_status()
    except HTTPError:
        return None

    return r.text

s = requests.Session()

try:
    with futures.ThreadPoolExecutor(max_workers=5) as ex:
        future_to_url = {ex.submit(get_url_data, url, s): url
                         for url in urlist}

    results = {future_to_url[future]: future.result() 
               for future in futures.as_completed(future_to_url)}
finally:
    s.close()

edited Aug 14 '14 at 06:48

answered Aug 14 '14 at 06:11

Roger Fan

4,945
31
38

Thanks, that looks promising, my programming is terribly slow as I'm still a beginner but I'll try this and get back to you by tomorrow! Does this mean I'll be using urllib instead of requests? But I can still use BeautifulSoup to do the parsing, right? – thatandrey Aug 14 '14 at 06:21
No, you should still almost certainly use `requests`. I'll see if I can find some old code I used and edit it into the answer. – Roger Fan Aug 14 '14 at 06:28
Thanks, that seems like everything! I'm still trying to figure out how to extract the data since I'm also new to CSS references. Is this putting all the html files into "results"? For every item in results, would I be able to run a for loop and do something like test = bs4.BeautifulSoup(results[0].text) ? – thatandrey Aug 14 '14 at 06:47
Hopefully! Or, depending on the number of sites and how you want to handle them, you could add any parsing and saving into your version of the `get_url_data` function, in which case you might not need to return a `results` dict at all. – Roger Fan Aug 14 '14 at 06:52
Thanks, I'll probably do that. Just as a final question, you didn't seem to use `time.sleep` - if I wanted to, could I just add `sleep(1)` to the end of the `get_url_data` function? – thatandrey Aug 14 '14 at 06:55
Yes. Though I'm pretty sure `time.sleep` releases the GIL, so it will allow other threads to run during that time. Alternatively, if you wanted to implement periodic pauses you could divide the urls into batches and do them one batch at a time with pauses in between. – Roger Fan Aug 14 '14 at 06:59

Scraping by sending concurrent requests with python

1 Answers1