4

I am trying to use a Scrapy spider to crawl a website using a FormRequest to send a keyword to the search query on a city-specific page. Seems straightforward with what I read, but I'm having trouble. Fairly new to Python so sorry if there is something obvious I'm overlooking.

Here are the main 3 sites I was trying to use to help me: Mouse vs Python [1]; Stack Overflow; Scrapy.org [3]

From the source code of the specific url I am crawling: www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents

From the source of the particular page I found: <input name="dnn$ctl01$txtSearch" type="text" maxlength="255" size="20" id="dnn_ctl01_txtSearch" class="NormalTextBox" autocomplete="off" placeholder="Search..." /> Which I think the name of the search is "dnn_ct101_txtSearch" which I would use in the example I found cited as 2, and I wanted to input "toyota" as my keyword within the vehicle search.

Here is the code I have of my spider right now, and I am aware I am importing excessive stuff in the beggining:

import scrapy
from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider

class LkqSpider(scrapy.Spider):
name = "lkq" 
allowed_domains = ["lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents"]
start_urls = ['http://www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents/']

def start_requests(self):
    return [ FormRequest("www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents",
                 formdata={'dnn$ctl01$txtSearch':'toyota'},
                 callback=self.parse) ]

def parsel(self):
    print self.status

Why is it not searching or printing any kind of results, is the example I'm copying from only intended for logging in on websites not entering to searchbars?

Thanks, Dan the newbie Python writer

Community
  • 1
  • 1
Daniel Royer
  • 75
  • 1
  • 11

2 Answers2

4

Here you go :)

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser


class Cars(scrapy.Item):
    Make = scrapy.Field()
    Model = scrapy.Field()
    Year = scrapy.Field()
    Entered_Yard = scrapy.Field()
    Section = scrapy.Field()
    Color = scrapy.Field()


class LkqSpider(scrapy.Spider):
    name = "lkq"
    allowed_domains = ["lkqpickyourpart.com"]
    start_urls = (
        'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US',
    )

    def parse(self, response):
        section_color = response.xpath(
            '//div[@class="pypvi_notes"]/p/text()').extract()
        info = response.xpath('//td["pypvi_make"]/text()').extract()
        for element in range(0, len(info), 4):
            item = Cars()
            item["Make"] = info[element]
            item["Model"] = info[element + 1]
            item["Year"] = info[element + 2]
            item["Entered_Yard"] = info[element + 3]
            item["Section"] = section_color.pop(
                0).replace("Section:", "").strip()
            item["Color"] = section_color.pop(0).replace("Color:", "").strip()
            yield item

        # open_in_browser(response)
        # inspect_response(response, self)

The page that you're trying to scrape is generated by an AJAX call.

Scrapy by default doesn't load any dynamically loaded Javascript content including AJAX. Almost all sites that load data dynamically as you scroll down the page are done using AJAX. ^^Trapping^^ AJAX call's are pretty simple using either Chrome Dev Tools or Firebug for Firefox. All you have to do is observe the XHR requests in Chrome Dev Tools or Firebug. XHR is an AJAX request.

Here's a screen shot of how it looks:

Capturing an XHR Request

Once you find the link, you can go change its attributes.

This is the link that the XHR request in Chrome Dev Tools gave me:

http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US

I've changed the page size to 1000 up there to give me a 1000 results per page. The default was 15. There's also a page number over there which you would ideally increase till you capture all the data.

BoreBoar
  • 2,619
  • 4
  • 24
  • 39
  • Thanks! This is really helpful. So in the start url it appears after "field=" and before "&sp" there could be a character variable "keyword" or something that could change depending on the results page I want to generate or scrape? – Daniel Royer Mar 27 '16 at 22:39
  • Thanks! This is really helpful. So in the start url it appears after "field=" and before "&sp" there could be a character variable "keyword" or something that could change depending on the search for the page I want to generate and scrape? What would be the best way to store the list of cars and accompanying images, if I wanted to be able to conglomerate it with search results scraped from other start url's too. – Daniel Royer Mar 27 '16 at 22:45
0

The web page requires javascript rendering framework to load the content in the scrapy code

Use Splash and refer the document for usage.

  • There isn't a need to use Splash or Selenium for getting simple AJAX calls. Check this link: http://stackoverflow.com/questions/16390257/scraping-ajax-pages-using-python – BoreBoar Mar 24 '16 at 12:25