0

Right now I am learning how to use Xpath to scrape websites in combination with python Scrapy. Right now I am stuck at the following:

I am looking at a dutch website http://www.ah.nl/producten/bakkerij/brood where I want to scrape the names of the products: enter image description here

So eventually I want a csv file with the names of the articles of all these breads. If I inspect elements, I get to see where these names are defined:

enter image description here

I need to find the right XPath to extract "AH Tijgerbrood bruin heel". So what I thought I should do in my spider is the following:

import scrapy
from stack.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "ah"
    allowed_domains = ["ah.nl"]
    start_urls = ['http://www.ah.nl/producten/bakkerij/brood']
    def parse(self, response):
        for sel in response.xpath('//div[@class="product__description small-7 medium-12"]'):
            item = DmozItem()
            item['title'] = sel.xpath('h1/text()').extract()
            yield item

Now, if I crawl with this spider, I dont get any result. I have no clue what I am missing here.

Badshah
  • 431
  • 1
  • 3
  • 13
  • 3
    It is highly likely that the content you are trying to scape is not actually available when you fetch the URL. It is probably populated using Javascript after the page is loaded. Try fetching `http://www.ah.nl/producten/bakkerij/brood` with `curl` and examining the resulting document. – larsks Aug 06 '15 at 13:09
  • @larsks thats exactly what is happening – heinst Aug 06 '15 at 13:23
  • @heinst: yes, I know, I looked before posting that comment :). – larsks Aug 06 '15 at 13:25
  • @larsks Does this mean that there is no way to scrape the names of the articles? – Badshah Aug 06 '15 at 13:25
  • http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python has some suggestions. – larsks Aug 06 '15 at 13:26

2 Answers2

1

You would have to use selenium for this task since all the elements are loaded in JavaScript:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.ah.nl/producten/bakkerij/brood")
#put an arbitrarily large number, you can tone it down, this is to allow the webpage to load
driver.implicitly_wait(40) 
elements = driver.find_elements_by_xpath('//*[local-name()= "div" and @class="product__description small-7 medium-12"]//*[local-name()="h1"]')
for elem in elements:
    print elem.text
heinst
  • 8,520
  • 7
  • 41
  • 77
0

title = response.xpath('//div[@class="product__description small-7 medium-12"]./h1/text').extract()[0]

Aswin Sathyan
  • 283
  • 4
  • 19