0

Is there a way to set a new proxy ip (e.g.: from a pool) according to the HTTP response status code? For example, start up with an IP form an IP list till it gets a 503 response (or another http error code), then use the next one till it gets blocked,and so on, something like:

if http_status_code in [403, 503, ..., n]:
    proxy_ip = 'new ip'
    # Then keep using it till it's gets another error code

Any ideas?

XO39
  • 481
  • 2
  • 9
  • 22

1 Answers1

1

Scrapy has a downloader middleware which is enabled by default to handle proxies. It's called HTTP Proxy Middleware and what it does is allows you to supply meta key proxy to your Request and use that proxy for this request.

There are few ways of doing this.
First one, straight-forward just use it in your spider code:

def parse(self, response):
    if response.status in range(400, 600):
        return Request(response.url, 
                       meta={'proxy': 'http://myproxy:8010'}
                       dont_filter=True)  # you need to ignore filtering because you already did one request to this url

Another more elegant way would be to use custom downloader middleware which would handle this for multiple callbacks and keep your spider code cleaner:

from project.settings import PROXY_URL
class MyDM(object):
    def process_response(self, request, response, spider):
        if response.status in range(400, 600):
            logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
            return Request(response.url, 
                           meta={'proxy': PROXY_URL}
                           dont_filter=True)
        return response

Note that by default scrapy doesn't let through any response codes other than 200 ones. Scrapy automatically handles redirect codes 300 with Redirect middleware and raises request errors on 400 and 500 with HttpError middleware. To handle requests other than 200 you need to either:

Specify that in Request Meta:

Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for all 
Request(url, meta={'handle_httpstatus_all': True})

Set a project/spider wide parameters:

HTTPERROR_ALLOW_ALL = True  # for all
HTTPERROR_ALLOWED_CODES = [404, 505]  # for specific

as per http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • 1
    It does not catch any HTTP status code except `200`! – XO39 Aug 27 '16 at 10:30
  • @XO39 Correct, by default scrapy does not allow any responses other than `200` through. See my edit for more details :) – Granitosaurus Aug 27 '16 at 13:13
  • But what if I don't want to handle them only to detect them? I mean all I want is just to change the ip when one of these error codes is met! Can I do that with out handling them this way? the documentation says: _Keep in mind, however, that it’s usually a **bad idea** to handle **non-200** responses, unless you really know what you’re doing._ – XO39 Aug 27 '16 at 18:24
  • Hmm I guess you can override `Http Proxy middleware` and use it instead of the default one. In it you could define rules to change proxies based on response codes but you would **still** have to modify the response/request since you want to retry the failed requests in with a new proxy and because of asynchronious nature of scrapy a bunch of requests with old proxy would remain in the queue that would need to be modified too, so there's no way of getting around this modification issue. – Granitosaurus Aug 27 '16 at 21:00
  • @XO39 However one idea would be to rearrange the order of the default middlewares to have your proxy middleware before http error middleware so you wouldn't need to handle the response codes. I'm not sure how safe that is but you can always try!:) – Granitosaurus Aug 27 '16 at 21:02
  • All the urls of the pages to be crawled are stored in a database, and they will extracted one by one, that solves the asynchronous issue. The only issue I'm still having is detecting the http response codes, in order to switch to another proxy when the first one gets blocked. – XO39 Aug 27 '16 at 21:10