1

Python 3.6 - Scrapy 1.5

I'm scraping the John Deere warranty webpage to watch all new PMP's and its expiration date. Looking inside network communication between browser and webpage I found a REST API that feed data in webpage.

Now, I'm trying to get json data from API rather scraping the javascript page's content. However, I'm getting a Internal Server Error and I don't know why.

I'm using scrapy to log in and catch data.

import scrapy

class PmpSpider(scrapy.Spider):
    name = 'pmp'
    start_urls = ['https://jdwarrantysystem.deere.com/portal/']

    def parse(self, response):

        self.log('***Form Request***')
        login ={
            'USERNAME':*******,
            'PASSWORD':*******
            }
        yield scrapy.FormRequest.from_response(
            response,
            url = 'https://registration.deere.com/servlet/com.deere.u90950.registrationlogin.view.servlets.SignInServlet',
            method = 'POST', formdata = login, callback = self.parse_pmp
        )
        self.log('***PARSE LOGIN***')

    def parse_pmp(self, response):
        self.log('***PARSE PMP***')
        cookies = response.headers.getlist('Set-Cookie')
        for cookie in cookies:
            cookie = cookie.decode('utf-8')
            self.log(cookie)
            cook = cookie.split(';')[0].split('=')[1]
            path = cookie.split(';')[1].split('=')[1]
            domain = cookie.split(';')[2].split('=')[1]
        yield scrapy.Request(
            url = 'https://jdwarrantysystem.deere.com/api/pip-products/collection',
            method = 'POST',
            cookies = {
                'SESSION':cook,
                'path':path,
                'domain':domain
            },
            headers = {
            "Accept":"application/json",
            "accounts":["201445","201264","201167","201342","201341","201221"],
            "excludedPin":"",
            "export":"",
            "language":"",
            "metric":"Y",
            "pipFilter":"OPEN",
            "pipType":["MALF","SAFT"]
            },
            meta = {'dont_redirect': True},
            callback = self.parse_pmp_list
        )

    def parse_pmp_list(self, response):
        self.log('***LISTA PMP***')
        self.log(response.body)

Why am I getting an error? How to get data from this API?

2018-07-05 17:26:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (failed 1 times): 500 Internal Server Error
2018-07-05 17:26:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (failed 2 times): 500 Internal Server Error
2018-07-05 17:26:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (failed 3 times): 500 Internal Server Error
2018-07-05 17:26:21 [scrapy.core.engine] DEBUG: Crawled (500) <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (referer: https://jdwarrantysystem.deere.com/portal/)
2018-07-05 17:26:21 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://jdwarrantysystem.deere.com/api/pip-products/collection>: HTTP status code is not handled or not allowed

Headers

Param

Request Headers

Example

1 Answers1

0

I found the problem: This is a POST request that must have a body data in json format, because unlike a GET request, the parameters don't go in the URI. The request header need too a "content-type": "application/json". See: How parameters are sent in POST request and Rest POST in python. So, editing the function parse_pmp:

def parse_pmp(self, response):
        self.log('***PARSE PMP***')
        cookies = response.headers.getlist('Set-Cookie')
        for cookie in cookies:
            cookie = cookie.decode('utf-8')
            self.log(cookie)
            cook = cookie.split(';')[0].split('=')[1]
            path = cookie.split(';')[1].split('=')[1]
            domain = cookie.split(';')[2].split('=')[1]

        data = json.dumps({"accounts":["201445","201264","201167","201342","201341","201221"],"excludedPin":"","export":"","language":"","metric":"Y","pipFilter":"OPEN","pipType":["MALF","SAFT"]}) # <----
        yield scrapy.Request(
            url = 'https://jdwarrantysystem.deere.com/api/pip-products/collection',
            method = 'POST',
            cookies = {
                'SESSION':cook,
                'path':path,
                'domain':domain
            },
            headers = {
            "Accept":"application/json",
            "content-type": "application/json" # <----
            },
            body = data, # <----
            meta = {'dont_redirect': True},
            callback = self.parse_pmp_list
        ) 

Everything works fine!