scrapy crawl spider ajax pagination

Question

I was trying to scrap link which has ajax call for pagination. I am trying to crawl http://www.demo.com link. and in .py file I provided this code for restrict XPATH and coding is:

# -*- coding: utf-8 -*-
import scrapy

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import sumSpider, Rule
from scrapy.selector import HtmlXPathSelector
from sum.items import sumItem

class Sumspider1(sumSpider):
    name = 'sumDetailsUrls'
    allowed_domains = ['sum.com']
    start_urls = ['http://www.demo.com']
    rules = (
        Rule(LinkExtractor(restrict_xpaths='.//ul[@id="pager"]/li[8]/a'), callback='parse_start_url', follow=True),
    )

    #use parse_start_url if your spider wants to crawl from first page , so overriding 
    def parse_start_url(self, response):
        print '********************************************1**********************************************'
        #//div[@class="showMoreCars hide"]/a
        #.//ul[@id="pager"]/li[8]/a/@href
        self.log('Inside - parse_item %s' % response.url)
        hxs = HtmlXPathSelector(response)
        item = sumItem()
        item['page'] = response.url
        title = hxs.xpath('.//h1[@class="page-heading"]/text()').extract() 
        print '********************************************title**********************************************',title
        urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract()
        print '**********************************************2***url*****************************************',urls

        finalurls = []       

        for url in urls:
            print '---------url-------',url
            finalurls.append(url)          

        item['urls'] = finalurls
        return item

My items.py file contains

from scrapy.item import Item, Field


class sumItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    page = Field()
    urls = Field()

Still I'm not getting exact output not able to fetch all pages when I am crawling it.

If parts of the web page are rendered with content retrieved using JavaScript, you need to render the complete page using a JavaScript engine before parsing it. Maybe you should look at http://www.seleniumhq.org/ ? — Jonatan, Dec 16 '14 at 10:09
Dear Jonatan, If you don't mind can you please briefly explain with example as I am beginner in this. — Charu Awhad, Dec 16 '14 at 10:21
I don't have an example at hand, unfortunately. Based on your description though, it seems like your problem is that the web page is not completely generated when you begin parsing it, therefore you need to process the JavaScript in the web page before parsing it. One way to do this is using a full browser, which is why I suggested using Selenium. — Jonatan, Dec 16 '14 at 10:28
@CharuAwhad: use scrapy shell to test what scrapy sees when it loads the page, prior to executing any JavaScript. Test your XPath expressions from there. More info: http://doc.scrapy.org/en/latest/topics/shell.html — bosnjak, Dec 16 '14 at 11:58
Dear Lawrence do you have any examples of it. As I am beginner in it. so not getting properly........ — Charu Awhad, Dec 16 '14 at 14:01
There are examples in the link i provided. You can find more in this tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html — bosnjak, Dec 16 '14 at 16:52
Dear All Thanks for guiding me I tried your solutions but atill didn't get all urls. then I followed this link http://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page . I tried to do by selenium but still not able to crawl all urls. I have posted my query here http://stackoverflow.com/questions/27525142/selenium-ajax-dynamic-pagination-base-spider. can anybody please help me.... — Charu Awhad, Dec 17 '14 at 12:54

Anantha · Accepted Answer · 2014-12-19T11:07:19.357

I hope the below code will help.

somespider.py

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver

def removeUnicodes(strData):
        if(strData):
            strData = strData.encode('utf-8').strip() 
            strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
        return strData

class demoSpider(scrapy.Spider):
    name = "domainurls"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/used/cars-in-trichy/']

    def __init__(self):
        self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)

    def parse(self, response):
        self.driver.get(response.url)
        self.driver.implicitly_wait(5)
        hxs = Selector(response)
        item = DemoItem()
        finalurls = []
        while True:
            next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')

            try:
                next.click()
                # get the data and write it to scrapy items
                item['pageurl'] = response.url
                item['title'] =  removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
                urls = self.driver.find_elements_by_xpath('.//a[@id="linkToDetails"]')

                for url in urls:
                    url = url.get_attribute("href")
                    finalurls.append(removeUnicodes(url))          

                item['urls'] = finalurls

            except:
                break

        self.driver.close()
        return item

items.py

from scrapy.item import Item, Field

class DemoItem(Item):
    page = Field()
    urls = Field()
    pageurl = Field()
    title = Field()

Note: You need to have selenium rc server running because HTMLUNITWITHJS works with selenium rc only using Python.

Run your selenium rc server issuing the command :

java -jar selenium-server-standalone-2.44.0.jar

Run your spider using command:

spider crawl domainurls -o someoutput.json

score 1 · Answer 2 · answered Dec 16 '14 at 17:02

You can check with your browser how the requests are made.

Behind the scene, right after you click on that button "show more cars" your browser will request a JSON data to feed your next page. You can take advantage of this fact and deal directly with the JSON data without the necessity to work with a JavaScript engine as Selenium or PhantomJS.

In your case, as the first step you should simulate an user scrolling down the page given by your start_url parameter and profile at the same time your network requests to discover the endpoint used by the browser to request that JSON. To discover this endpoint in general there is a XHR(XMLHttpRequest) section on the browser's profile tool as here in Safari where you can navigate thought all resources/endpoints used to request the data.

Once you discover this endpoint it's a straightforward task: you give your Spider as start_url the endpoint that you just discovered and according you process and navigate through the JSON's you can discover if it a next page to request.

P.S.: I saw for you that the endpoint url is http://www.carwale.com/webapi/classified/stockfilters/?city=194&kms=0-&year=0-&budget=0-&pn=2

In this case my browser requested the second page, as you can see in the parameter pn. It's is important you set the some header parameters before you send the request. I noticed in your case the headers are:

Accept text/plain, /; q=0.01

Referer http://www.carwale.com/used/cars-in-trichy/

X-Requested-With XMLHttpRequest

sourceid 1

User-Agent Mozilla/5.0...

Thanks for helping me. I tried to work by selenium and I have written code: — Charu Awhad, Dec 17 '14 at 11:28
Dear All Thanks for guiding me I tried your solutions but atill didn't get all urls. then I followed this link http://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page . I tried to do by selenium but still not able to crawl all urls. I have posted my query here http://stackoverflow.com/questions/27525142/selenium-ajax-dynamic-pagination-base-spider. can anybody please help me.... — Charu Awhad, Dec 17 '14 at 12:56
@CharuAwhad, make yourself comfortable, man. I saw your post at glance mentioned in the last comment. I noticed you simulated a user's behaviour using `click()` function to get next page. I guess that won't work for you. Actually that wasn't what I meant. Indeed, you need to request in a explicit way the endpoint url that returns a JSON data to you, besides attach the request headers and then you can process and iterate over next pages. This means that you need to simulate what is going on behind the scene, instead of simulate an user clicking. Hope you have success there. — Saulo Ricci, Dec 17 '14 at 14:43

scrapy crawl spider ajax pagination

2 Answers2

somespider.py

Linked