19

I'm using scrapy to crawl my sitemap, to check for 404, 302 and 200 pages. But i can't seem to be able to get the response code. This is my code so far:

from scrapy.contrib.spiders import SitemapSpider


class TothegoSitemapHomesSpider(SitemapSpider):
    name ='tothego_homes_spider'

    ## robe che ci servono per tothego ##
   sitemap_urls = []
   ok_log_file =       '/opt/Workspace/myapp/crawler/valid_output/ok_homes'
   bad_log_file =      '/opt/Workspace/myapp/crawler/bad_homes'
   fourohfour =        '/opt/Workspace/myapp/crawler/404/404_homes'

   def __init__(self, **kwargs):
        SitemapSpider.__init__(self)

        if len(kwargs) > 1:
            if 'domain' in kwargs:
                self.sitemap_urls = ['http://url_to_sitemap%s/sitemap.xml' % kwargs['domain']]

            if 'country' in kwargs:
                self.ok_log_file += "_%s.txt" % kwargs['country']
                self.bad_log_file += "_%s.txt" % kwargs['country']
                self.fourohfour += "_%s.txt" % kwargs['country']

        else:
            print "USAGE: scrapy [crawler_name] -a country=[country] -a domain=[domain] \nWith [crawler_name]:\n- tothego_homes_spider\n- tothego_cars_spider\n- tothego_jobs_spider\n"
            exit(1)

    def parse(self, response):
        try:
            if response.status == 404:
                ## 404 tracciate anche separatamente
                self.append(self.bad_log_file, response.url)
                self.append(self.fourohfour, response.url)

            elif response.status == 200:
                ## printa su ok_log_file
                self.append(self.ok_log_file, response.url)
            else:
                self.append(self.bad_log_file, response.url)

        except Exception, e:
            self.log('[eccezione] : %s' % e)
            pass

    def append(self, file, string):
        file = open(file, 'a')
        file.write(string+"\n")
        file.close()

From scrapy's docs, they said that response.status parameter is an integer corresponding to the status code of the response. So far, it logs only the 200 status urls, while the 302 aren't written on the output file (but i can see the redirects in crawl.log). So, what do i have to do to "trap" the 302 requests and save those urls?

Samuele Mattiuzzo
  • 10,760
  • 5
  • 39
  • 63

2 Answers2

27

http://readthedocs.org/docs/scrapy/en/latest/topics/spider-middleware.html#module-scrapy.contrib.spidermiddleware.httperror

Assuming default spider middleware is enabled, response codes outside of the 200-300 range are filtered out by HttpErrorMiddleware. You can tell the middleware you want to handle 404s by setting the handle_httpstatus_list attribute on your spider.

class TothegoSitemapHomesSpider(SitemapSpider):
    handle_httpstatus_list = [404]
njbooher
  • 612
  • 7
  • 18
  • maybe my question is a bit fuzzy. my primary urge is to write on a file the 200 responses and on another file the 302 responses (the url that raises that 302). you can ignore the first if block. what i need is to write the 200 on the ok_log_file and the 302 on the bad_log_file, and i tought i could be able to do it just checking on the response.status integer code (since, as your link says, they are in the 200-300 range) – Samuele Mattiuzzo Mar 14 '12 at 09:22
  • When it says 200-300 range it means 200-299 I expect. Try setting handle_httpstatus_list = [302] and responses for which response.status == 302 should start getting to your parse method. – njbooher Mar 14 '12 at 09:27
  • then i intepreted very badly that "range" term. It was literal, but i tought it was meant for all the 2xx and 3xx responses. i'm trying the list and i'll let you know back! thanks for now! – Samuele Mattiuzzo Mar 14 '12 at 09:32
  • i did exactly as you said and also added httperrormiddleware to spider_middlewares dict, but this doesn't seem to affect my script – Samuele Mattiuzzo Mar 14 '12 at 10:22
  • turns out you were 50% right, and i found out how to make it 100%! Cheers, thanks for pointing me out to the right direction! – Samuele Mattiuzzo Mar 14 '12 at 10:43
  • 5
    What was needed to make it 100%? (for sake of completeness) – Wizzard Apr 04 '13 at 10:35
  • Great. It helped a lot. Thanks. – Boky Oct 24 '18 at 06:02
2

Only to have a complete response here:

  • Set Handle_httpstatus_list = [302];

  • On request, set dont_redirect to True on meta.

For example: Request(URL, meta={'dont_redirect': True});

parik
  • 2,313
  • 12
  • 39
  • 67
Ricardo Lucca
  • 158
  • 3
  • 10