Xpath selector in python Scrapy

Question

Right now I am learning how to use Xpath to scrape websites in combination with python Scrapy. Right now I am stuck at the following:

I am looking at a dutch website http://www.ah.nl/producten/bakkerij/brood where I want to scrape the names of the products:

So eventually I want a csv file with the names of the articles of all these breads. If I inspect elements, I get to see where these names are defined:

I need to find the right XPath to extract "AH Tijgerbrood bruin heel". So what I thought I should do in my spider is the following:

import scrapy
from stack.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "ah"
    allowed_domains = ["ah.nl"]
    start_urls = ['http://www.ah.nl/producten/bakkerij/brood']
    def parse(self, response):
        for sel in response.xpath('//div[@class="product__description small-7 medium-12"]'):
            item = DmozItem()
            item['title'] = sel.xpath('h1/text()').extract()
            yield item

Now, if I crawl with this spider, I dont get any result. I have no clue what I am missing here.

It is highly likely that the content you are trying to scape is not actually available when you fetch the URL. It is probably populated using Javascript after the page is loaded. Try fetching `http://www.ah.nl/producten/bakkerij/brood` with `curl` and examining the resulting document. — larsks, Aug 06 '15 at 13:09
@heinst: yes, I know, I looked before posting that comment :). — larsks, Aug 06 '15 at 13:25
@larsks Does this mean that there is no way to scrape the names of the articles? — Badshah, Aug 06 '15 at 13:25
http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python has some suggestions. — larsks, Aug 06 '15 at 13:26

score 1 · Answer 1 · answered Aug 06 '15 at 23:39

You would have to use selenium for this task since all the elements are loaded in JavaScript:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.ah.nl/producten/bakkerij/brood")
#put an arbitrarily large number, you can tone it down, this is to allow the webpage to load
driver.implicitly_wait(40) 
elements = driver.find_elements_by_xpath('//*[local-name()= "div" and @class="product__description small-7 medium-12"]//*[local-name()="h1"]')
for elem in elements:
    print elem.text

score 0 · Answer 2 · answered Feb 08 '16 at 12:30

0

title = response.xpath('//div[@class="product__description small-7 medium-12"]./h1/text').extract()[0]

answered Feb 08 '16 at 12:30

Aswin Sathyan

283
4
19

Xpath selector in python Scrapy

2 Answers2