1

I'm trying to do an authenticated scrapy login with InitSpider. For some reason, with InitSpider it ALWAYS fails to login. My code is similar to the answer in the below post:

Crawling LinkedIn while authenticated with Scrapy

The response I see in logs is this:

2012-12-20 22:56:53-0500 [linked] DEBUG: Redirecting (302) to <GET https://example.com/> from <POST https://example.com/>

Using the code from the above post, I have the same init_request, login, and check_login_response functions. I can see with log statements that it reaches the login function, but it seems to never reach the check_login_response function.

When I re-implement the code using BaseSpider, and I do the FormRequest in the parse function, i'm able to login with no issue. Is there a reason for this? Is there something else I should be doing? Why am I getting a redirect for logging in with InitSpider?

[EDIT]

class DemoSpider(InitSpider):
    name = 'linked'
    login_page = # Login URL
    start_urls = # All other urls

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        #"""Generate a login request."""
        return FormRequest.from_response(response, 
            formdata={'username': 'username', 'password': 'password'},
            callback=self.check_login_response)

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we are successfully logged in."""
        if "Sign Out" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..
            return self.initialized()
        else:
            self.log("\n\n\nFailed, Bad times :(\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        self.log('got to the parse function')

Above is my spider code.

Community
  • 1
  • 1
KVISH
  • 12,923
  • 17
  • 86
  • 162

1 Answers1

2

After struggling with this for a bit, I figured it out, and I posted the solution on my blog:

http://tmblr.co/ZjkSZteCOTyH

Basically I use BaseSpider and I override the start_requests method to handle the login.

KVISH
  • 12,923
  • 17
  • 86
  • 162