0

I want to retrieve information of costs of Mobiles from http://www.bigcmobiles.in/categories/Mobile-Phones-Smart-Phones/cid-CU00091056.aspx. I used hxs.select('.//div[1]/div/div[1]/div/span/label[2]').extract(), which is giving me an empty dictionary.

Can you please explain me reason for this?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195

1 Answers1

1

The problem is that products (mobiles) on this site are loaded dynamically via XHR request. You have to simulate it in scrapy in order to get necessary data. For more info on the subject, see:

Here's the spider in your case. Note, that the url I've got from chrome developer tools, network tab:

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class BigCMobilesItem(Item):
    title = Field()
    price = Field()


class BigCMobilesSpider(BaseSpider):
    name = "bigcmobile_spider"
    allowed_domains = ["bigcmobiles.in"]
    start_urls = [
        "http://www.bigcmobiles.in/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput={%22PgControlId%22:1152173,%22IsConfigured%22:true,%22ConfigurationType%22:%22%22,%22CombiIds%22:%22%22,%22PageNo%22:1,%22DivClientId%22:%22ctl00_ContentPlaceHolder1_ctl00_ctl07_Showcase%22,%22SortingValues%22:%22%22,%22ShowViewType%22:%22%22,%22PropertyBag%22:null,%22IsRefineExsists%22:true,%22CID%22:%22CU00091056%22,%22CT%22:0,%22TabId%22:0}&_=1369724967084"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        mobiles = hxs.select("//div[@class='bucket']")
        print mobiles
        for mobile in mobiles:
            item = BigCMobilesItem()
            item['title'] = mobile.select('.//h4[@class="mtb-title"]/text()').extract()[0]
            try:
                item['price'] = mobile.select('.//span[@class="mtb-price"]/label[@class="mtb-ofr"]/text()').extract()[
                    1].strip()
            except:
                item['price'] = 'n/a'
            yield item

Save it in spider.py, and run via scrapy runspider spider.py -o output.json. Then in output.json you will see:

{"price": "13,999", "title": "Samsung Galaxy S Advance i9070"}
{"price": "9,999", "title": "Micromax A110 Canvas 2"}
{"price": "25,990", "title": "LG Nexus 4 E960"}
{"price": "39,500", "title": "Samsung Galaxy S4 I9500 - Black"}
...

These are products from the first page. In order to get mobiles from other pages, take a look at the XHR request the site is using - it has PageNo parameter - looks like what you need.

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • scrapy runspider spider.py -o output.json compiling the code with this command is giving me an error as follows: Usage ===== scrapy runspider [options] runspider: error: no such option:-o what is the reason for this kind of error ?? – user1735576 May 28 '13 at 09:27
  • Check if you have the latest (0.17) scrapy version. – alecxe May 28 '13 at 09:30
  • No ...pip install --upgrade Scrapy ------ OSError: [Errno 13] Permission denied: '/usr/share/pyshared/Scrapy-0.12.0.2542.egg-info' is giving me an error – user1735576 May 28 '13 at 12:49
  • Looks like you should run `pip install --upgrade scrapy` with `sudo`. And, yeah, you have 0.12 - it's rather old. – alecxe May 28 '13 at 12:54
  • Last one was using get type so we were able to XHR requests where as sites like http://www.univercell.in/buy/SMART is using post one where the request URL is having no arguments as follows http://www.univercell.in/control/AjaxCategoryDetail?productCategoryId=PRO-SMART&category_id=PRO-SMART&attrName=&min=&max=&sortSearchPrice=&VIEW_INDEX=1&VIEW_SIZE=15&serachupload=&sortupload= . Can you please suggest me a method to get costs from this site – user1735576 May 28 '13 at 14:15