4

I am quite new to scrapy and have build a few spiders. I am trying to scrape reviews from this page. My spider so far crawls the first page and scrape those items, but when it comes to pagination it does not follow links.

I know this happens because it is an Ajax request, but is a POST and not a GET am newbie about these but I read this. I have read this post here and follow the "mini-tutorial" to get the url from the response that seems to be

http://www.pcguia.pt/category/reviews/sorter=recent&location=&loop=main+loop&action=sort&view=grid&columns=3&paginated=2&currentquery%5Bcategory_name%5D=reviews

but when I try to open it on browser it says

"Página nao encontrada"="PAGE NOT FOUND"

So far am I thinking right, what am I missing?

EDIT: my spider:

import scrapy
import json
from scrapy.http import FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from pcguia.items import ReviewItem

class PcguiaSpider(scrapy.Spider):
    name = "pcguia" #spider name to call in terminal
    allowed_domains = ['pcguia.pt'] #the domain where the spider is allowed to crawl
    start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1'] #url from which the spider will start crawling
    page_incr = 1
    pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

    def parse(self, response):

        sel = Selector(response)

        if self.page_incr > 1:
            json_data = json.loads(response.body)
            sel = Selector(text=json_data.get('content', ''))


        hxs = Selector(response)

        item_pub = ReviewItem()

        item_pub['date']= hxs.xpath('//span[@class="date"]/text()').extract() # is in the format year-month-dayThours:minutes:seconds-timezone ex: 2015-03-31T09:40:00-0700


        item_pub['title'] = hxs.xpath('//title/text()').extract()

        #pagination code starts here 
        # if page has content
        if sel.xpath('//div[@class="panel-wrapper"]'):
            self.page_incr +=1
            formdata = {
                        'sorter':'recent',
                        'location':'main loop',
                        'loop':'main loop',
                        'action':'sort',
                        'view':'grid',
                        'columns':'3',
                        'paginated':str(self.page_incr),
                        'currentquery[category_name]':'reviews'
                        }
            yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
        else:
            return

        yield item_pub

output:

2015-05-12 14:53:45+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: pcguia)
2015-05-12 14:53:45+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-12 14:53:45+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pcguia.spiders', 'SPIDER_MODULES': ['pcguia.spiders'], 'BOT_NAME': 'pcguia'}
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled item pipelines: 
2015-05-12 14:53:45+0100 [pcguia] INFO: Spider opened
2015-05-12 14:53:45+0100 [pcguia] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6033
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6090
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Crawled (200) <GET http://www.pcguia.pt/category/reviews/#paginated=1> (referer: None)
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/category/reviews/>
    {'date': '',
     'title': [u'Reviews | PCGuia'],
}
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Crawled (200) <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php> (referer: http://www.pcguia.pt/category/reviews/)
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
    {'date': ''
     'title': ''
}
Community
  • 1
  • 1
Inês Martins
  • 530
  • 2
  • 10
  • 23
  • from where are you taking the date ?? I couldn't find any satisfying xpath for that date ? Title means the review title right ? seems like you have taken the page title. Post the possible output you want to fetch – Jithin May 13 '15 at 04:05
  • I am taking the date from for example here: http://www.pcguia.pt/desktops/asus-rog-gr8/ the xpath '//span[@class="date"]/text()' is pointing to ' Publicado a 10 Dezembro, 2014 ' – Inês Martins May 13 '15 at 08:13

2 Answers2

4

you can try this

from scrapy.http import FormRequest
from scrapy.selector import Selector
# other imports

class SpiderClass(Spider)
    # spider name and all
    page_incr = 1
    pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

    def parse(self, response):

        sel = Selector(response)

        if page_incr > 1:
            json_data = json.loads(response.body)
            sel = Selector(text=json_data.get('content', ''))

        # your code here 

        #pagination code starts here 
        # if page has content
        if sel.xpath('//div[@class="panel-wrapper"]'):
            self.page_incr +=1
            formdata = {
                    'sorter':'recent',
                    'location':'main loop',
                    'loop':'main loop',
                    'action':'sort',
                    'view':'grid',
                    'columns':'3',
                    'paginated':str(self.page_incr),
                    'currentquery[category_name]':'reviews'
                    }
            yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
        else:
            return

I have tested using scrapy shell and its working,

In scrapy Shell

In [0]: response.url
Out[0]: 'http://www.pcguia.pt/category/reviews/#paginated=1'

    In [1]: from scrapy.http import FormRequest

In [2]: from scrapy.selector import Selector

In [3]: import json

In [4]: response.xpath('//h2/a/text()').extract()
Out[4]: 
        [u'HP Slate 8 Plus',
         u'Astro A40 +MixAmp Pro',
         u'Asus ROG G751J',
         u'BQ Aquaris E5 HD 4G',
         u'Asus GeForce GTX980 Strix',
         u'AlienTech BattleBox Edition',
         u'Toshiba Encore Mini WT7-C',
         u'Samsung Galaxy Note 4',
         u'Asus N551JK',
         u'Western Digital My Passport Wireless',
         u'Nokia Lumia 735',
         u'Photoshop Elements 13',
         u'AMD Radeon R9 285',
         u'Asus GeForce GTX970 Stryx',
         u'TP-Link AC750 Wifi Repeater']

In [5]: url = "http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php"

In [6]: formdata = {
        'sorter':'recent',
        'location':'main loop',
        'loop':'main loop',
        'action':'sort',
        'view':'grid',
        'columns':'3',
        'paginated':'2',
        'currentquery[category_name]':'reviews'
        }

In [7]: r = FormRequest(url=url, formdata=formdata)

In [8]: fetch(r)
        2015-05-12 18:29:16+0530 [default] DEBUG: Crawled (200) <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php> (referer: None)
        [s] Available Scrapy objects:
        [s]   crawler    <scrapy.crawler.Crawler object at 0x7fcc247c4590>
        [s]   item       {}
        [s]   r          <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
        [s]   request    <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
        [s]   response   <200 http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
        [s]   settings   <scrapy.settings.Settings object at 0x7fcc2a74f450>
        [s]   spider     <Spider 'default' at 0x7fcc239ba990>
        [s] Useful shortcuts:
        [s]   shelp()           Shell help (print this help)
        [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
        [s]   view(response)    View response in a browser

In [9]: json_data = json.loads(response.body)

In [10]: sell = Selector(text=json_data.get('content', ''))

In [11]: sell.xpath('//h2/a/text()').extract()
Out[11]: 
        [u'Asus ROG GR8',
         u'Devolo dLAN 1200+',
         u'Yezz Billy 4,7',
         u'Sony Alpha QX1',
         u'Toshiba Encore2 WT10',
         u'BQ Aquaris E5 FullHD',
         u'Toshiba Canvio AeroMobile',
         u'Samsung Galaxy Tab S 10.5',
         u'Modecom FreeTab 7001 HD',
         u'Steganos Online Shield VPN',
         u'AOC G2460PG G-Sync',
         u'AMD Radeon R7 SSD',
         u'Nvidia Shield',
         u'Asus ROG PG278Q GSync',
         u'NOX Krom Kombat']

EDIT

import scrapy
import json
from scrapy.http import FormRequest, Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from pcguia.items import ReviewItem
from dateutil import parser
import re


class PcguiaSpider(scrapy.Spider):
    name = "pcguia" #spider name to call in terminal
    allowed_domains = ['pcguia.pt'] #the domain where the spider is allowed to crawl
    start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1'] #url from which the spider will start crawling
    page_incr = 1
    pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

    def parse(self, response):
        sel = Selector(response)
        if self.page_incr > 1:
            json_data = json.loads(response.body)
            sel = Selector(text=json_data.get('content', ''))
        review_links = sel.xpath('//h2/a/@href').extract()
        for link in review_links:
            yield Request(url=link, callback=self.parse_review)
        #pagination code starts here 
        # if page has content
        if sel.xpath('//div[@class="panel-wrapper"]'):
            self.page_incr +=1
            formdata = {
                        'sorter':'recent',
                        'location':'main loop',
                        'loop':'main loop',
                        'action':'sort',
                        'view':'grid',
                        'columns':'3',
                        'paginated':str(self.page_incr),
                        'currentquery[category_name]':'reviews'
                        }
            yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
        else:
            return

    def parse_review(self, response):
        month_matcher = 'novembro|janeiro|agosto|mar\xe7o|fevereiro|junho|dezembro|julho|abril|maio|outubro|setembro'
        month_dict = {u'abril': u'April',
                                u'agosto': u'August',
                                u'dezembro': u'December',
                                u'fevereiro': u'February',
                                u'janeiro': u'January',
                                u'julho': u'July',
                                u'junho': u'June',
                                u'maio': u'May',
                                u'mar\xe7o': u'March',
                                u'novembro': u'November',
                                u'outubro': u'October',
                                u'setembro': u'September'}
        review_date = response.xpath('//span[@class="date"]/text()').extract()
        review_date = review_date[0].strip().strip('Publicado a').lower() if review_date else ''
        month = re.findall('%s'% month_matcher, review_date)[0]
        _date = parser.parse(review_date.replace(month, month_dict.get(month))).strftime('%Y-%m-%dT%H:%M:%T')
        title = response.xpath('//h1[@itemprop="itemReviewed"]/text()').extract()
        title = title[0].strip() if title else ''
        item_pub = ReviewItem(
            date=_date,
            title=title)
        yield item_pub

output

{'date': '2014-11-05T00:00:00', 'title': u'Samsung Galaxy Tab S 10.5'}
Jithin
  • 1,692
  • 17
  • 25
0

The proper solution for this would be using the selenium. See the problem you are facing is that new source code don't get updated in your scrapy spider.

Selenium will help you to click on the subsequent links and pass the updated source code to your response.xpath.

I can provide you with more help if you just share the scrapy code you are using.

John Dene
  • 550
  • 1
  • 7
  • 21