0

I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:

class HttpStatusSpider(scrapy.Spider):
    name = 'httpstatus'
    handle_httpstatus_all = True

    link_extractor = LinkExtractor()

    def start_requests(self):
        """This method ensures we login before we begin spidering"""
        # Little bit of magic to handle the CSRF protection on the login form
        resp = requests.get('http://localhost:8000/login/')
        tree = html.fromstring(resp.content)
        csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value

        return [FormRequest('http://localhost:8000/login/', callback=self.parse,
                            formdata={'username': 'mischa_cs',
                                      'password': 'letmein',
                                      'csrfmiddlewaretoken': csrf_token},
                            cookies={'csrftoken': resp.cookies['csrftoken']})]

    def parse(self, response):
        item = HttpResponseItem()
        item['url'] = response.url
        item['status'] = response.status
        item['referer'] = response.request.headers.get('Referer', '')
        yield item

        for link in self.link_extractor.extract_links(response):
            r = Request(link.url, self.parse)
            r.meta.update(link_text=link.text)
            yield r

The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.

I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.

What do I need to change to capture the HTTP error codes scrapy is encountering?

chrisbunney
  • 5,819
  • 7
  • 48
  • 67
  • please remove the `allowed_domains` argument, it isn't needed and it could also filter your requests, maybe that's the problem – eLRuLL Dec 17 '18 at 19:13
  • I removed the `allowed_domains = ['localhost']` with no change in behaviour – chrisbunney Dec 17 '18 at 19:21
  • I put the `allowed_domains = ['localhost']` back in, after the spider ended up finding its way onto tripadvisor: `2018-12-17 19:29:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186364-c31-zfp5-Sheffield_South_Yorkshire_England.html>` – chrisbunney Dec 17 '18 at 19:30
  • ok, so now we are facing another problem? Please check my answer – eLRuLL Dec 17 '18 at 19:32

2 Answers2

1

handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.

I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Ah, very interesting. That's an easily overlooked difference, and I can now see 4xx codes being captured. Not sure that the 5xxs are getting captured though. Next step is to try an `errback` – chrisbunney Dec 17 '18 at 19:34
  • Glad I helped you get the http requests you needed. – eLRuLL Dec 17 '18 at 19:36
1

So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).

I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}
chrisbunney
  • 5,819
  • 7
  • 48
  • 67
  • sure I would say it is a good solution, but of course just for your project, I don't think this could be recommended in a project with a lot of spiders where we only need to disable this for some spiders or even requests. – eLRuLL Dec 20 '18 at 17:49