0

I have wrote a spider to scrap a few elements from a website but the problem is i am unable to fetch some of the elements and some are working fine. Please help me in right direction.

Here is my spider code:

from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ScrapyScraper.items import ScrapyscraperItem

class ScrapyscraperSpider(CrawlSpider) :
    name = "rs"
    allowed_domains = ["mega.pk"]
    start_urls = ["http://www.mega.pk/mobiles/"]

    rules = (
        Rule(SgmlLinkExtractor(allow = ("http://www\.mega\.pk/mobiles_products/[0-9]+\/[a-zA-Z-0-9.]+",)), callback = 'parse_item', follow = True),
    )

    def parse_item(self, response) :
        sel = Selector(response)
        item = ScrapyscraperItem()

        item['Heading'] = sel.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[1]/h2/span/text()').extract()
        item['Content'] = sel.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/p/text()').extract()
        item['Price'] = sel.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[2]/div[1]/div[2]/span/text()').extract()
        item['WiFi'] = sel.xpath('//*[@id="laptop_detail"]/tbody/tr/td[contains(. ,"Wireless")]/text()').extract()

        return item

Now i am able to get Heading, Content and Price but Wifi returns nothing. The point where i get totally confused is that the same xpath works in chrome and not in python(scrapy).

Mansoor Akram
  • 1,997
  • 4
  • 24
  • 40

1 Answers1

1

I 'm still learning myself, though I think I may see your problem.

I would imagine you are looking to find the wifi status - in which case you need the text of the span of the next element:

import urllib2
import lxml.html as LH 

url = 'http://www.mega.pk/laptop_products/13242/Apple-MacBook-Pro-with-Retina-Display-Z0RG0000V.html'
response = urllib2.urlopen(url)
html = response.read()
doc=LH.fromstring(html)
heading = doc.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[1]/h2/span/text()')
content = doc.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/p/text()')
price = doc.xpath('//*[@id="main1"]/div[1]/div[1]/div/div[2]/div[2]/div/div[2]/div[1]/div[2]/span/text()')
wifi_location = doc.xpath('//*[@id="laptop_detail"]//tr/td[contains(. ,"Wireless")]')[0]
wifi_status = wifi_location.getnext().find('span').text

I only checked a single page, but hopefully this helps. I am unsure why the xpath does not work.. I will be doing more reading but I often find that the inclusion of tbody does not function properly in this setting. I typically have opted to skip to td via //.

Edit

Found the reason, it looks like chrome will insert tbody when it is not included in original html. Scrapy is trying to parse the original HTML without this feature.

Extracting lxml xpath for html table

Community
  • 1
  • 1
ryanmc
  • 1,821
  • 12
  • 14