0

Trying to log into a CMS membership site with code from scrapy document and fellow posts, but I keep coming up short. My error messages:

2017-03-20 18:18:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/robots.txt> (referer: None)
2017-03-20 18:18:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/wp-login.php> (referer: None)
2017-03-20 18:18:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <POST http://members.com/login.php> from <POST http://members.com/login.ph
p?wpe-login=membersipa>

I tried changing user agent to:

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0'

But my errors were:

2017-03-20 17:47:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET     http://members.com/robots.txt> (referer: None)
2017-03-20 17:47:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/wp-login.php> (referer: None)
2017-03-20 17:47:23 [scrapy.core.engine] DEBUG: Crawled (403) <POST http://members.com/wp-login.php?wpe-login=membersipa> (referer: http://members.com/wp-login.php)
2017-03-20 17:47:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://members.com/wp-login.php?wpe-login=membersipa>: HTTP status code is not handled or not
allowed

This is the code that produced the errors:

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'freddy'
    start_urls = ['http://members.com/wlogin.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'log': 'name', 'pwd': 'password'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return
    else:
            return Request(url="http://members.com",
               callback=self.parse_ipro)

    def parse_ipro(self, response):
        title = response.xpath('/html/body/div[2]/div/div[1]/div/div/div[2]/div/div/main/article/header/h1').extract_first()
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

Ultimately, I would like to use scrapy shell to test selectors and tried using scrapy

Tried with scrapy shell but also knocked on butt:

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'freddy'
    start_urls = ['http://members.com/wlogin.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'log': 'name', 'pwd': 'password'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

And tested this in shell:

response.xpath('//title/text()').extract_first()

but received 301 and 302 redirections

after adding:

def parse(self, response):
    return scrapy.FormRequest.from_response(
        response,
        headers={'Content-Type': 'text/html; charset=UTF-8', 'User-Agent':
                 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'},
        formdata={'log': 'Name', 'pwd': 'Password', },
        callback=self.after_login
    )

the message changed to:

2017-03-22 03:46:07 [scrapy.core.engine] INFO: Spider opened
2017-03-22 03:46:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-22 03:46:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 
2017-03-22 03:46:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://members.com/login.php> (referer: None)
2017-03-22 03:46:08 [scrapy.core.scraper] ERROR: Spider error processing <GET http://members.com/login.php> (referer: None)
Traceback (most recent call last):

Help is appreciated

Community
  • 1
  • 1
iabraham
  • 3
  • 1
  • 4

1 Answers1

1

You are most likely missing some headers in your FormRequest.

Open up networks tab in your browser tools, find the requests you are looking for and look under "request headers" part (see related issue Can scrapy be used to scrape dynamic content from websites that are using AJAX?). Some of the headers are not necessary and some are already included by FormRequest, however some are not so you need to replicate those.

Usually it's Content-Type header that needs to be replicated.

headers = {
    'Content-Type': 'json/...',
}
req = FormRequest(url, formdata=form, headers=headers)
Community
  • 1
  • 1
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • Thanks for your help @granitosaurus. I added content for steps taken based on your suggestions. Also, the header in posts (using firebug) is as follows: log: name pwd: password redirect_to: http://members.com/admin/ testcookie: 1 wp-submit: Log In – iabraham Mar 22 '17 at 08:11