3

I'm having trouble with Python Scrapy.

I have a spider that attempts to login to a site before crawling it, however the site is configured to return response code HTTP 401 on the login page which stops the spider from continuing (even though in the body of that response, the login form is there for submitting).

This is the relevant parts of my crawler:

class LoginSpider(Spider):
name = "login"
start_urls = ["https://example.com/login"]

def parse(self, response):
    # Initial user/pass submit
    self.log("Logging in...", level=log.INFO)

The above yields:

2014-02-23 11:52:09+0000 [login] DEBUG: Crawled (401) <GET https://example.com/login> (referer: None)
2014-02-23 11:52:09+0000 [login] INFO: Closing spider (finished)

However if I give it another URL to start on (not the login page) which returns a 200:

2014-02-23 11:50:19+0000 [login] DEBUG: Crawled (200) <GET https://example.com/other-page> (referer: None)
2014-02-23 11:50:19+0000 [login] INFO: Logging in...

You see it goes on to execute my parse() method and make the log entry.

How do I make Scrapy continue to work with the page despite a 401 response code?

deed02392
  • 4,799
  • 2
  • 31
  • 48
  • 1
    possible duplicate of [Scrapy and response status code: how to check against it?](http://stackoverflow.com/questions/9698372/scrapy-and-response-status-code-how-to-check-against-it) – Wrikken Feb 23 '14 at 12:38
  • I still think it is. You give scrapy one page, it is not valid in the current setting with that response code, there are no more pages to crawl, hence: finished. How do you want both scrapy to continue on a 401, but not to add a 401 to valid pages? Those 2 requirements are mutually exclusive. – Wrikken Feb 23 '14 at 12:54
  • I see, I think the other question does in fact answer this but it's slightly different, this issue is that Scrapy doesn't indicate it's going to ignore the 401 but specifying it as an allowed value fixes the problem. – deed02392 Feb 23 '14 at 12:56
  • I don't understand the issue. If you add `handle_httpstatus_list = [401]` to your spider class, is the login response handled as expected? – Talvalin Feb 23 '14 at 23:16

1 Answers1

3

On the off-chance this question isn't closed as a duplicate, explicitly adding 401 to handle_httpstatus_list fixed the issue

class LoginSpider(Spider):
    handle_httpstatus_list = [401]
    name = "login"
    start_urls = ["https://example.com/login"]

    def parse(self, response):
        # Initial user/pass submit
        self.log("Logging in...", level=log.INFO)
Talvalin
  • 7,789
  • 2
  • 30
  • 40