Using Scrapy to retrieve nested data

Question

I´m using scrapy to try and retrieve url data in a nested class. I´ve tried following the tutorials and similiar questions, but I´m coming up short in my seeminly simple task.

The page I´m trying to scrape is this one: http://www.leasingcar.dk/privatleasing

For every vehicle on the page I want to get the xpath thats leads to the "data-nice_url" text. The first result should therefore be "/privatleasing/Citro%c3%abn-Berlingo/eHDi-90-Seduction-E6G". But I´m getting an empty data set everytime. I´ve tried varying the xpath without any look.

My code looks like this:

from scrapy.spiders import Spider
from stack.items import StackItem
from scrapy.selector import Selector


class Spider(Spider):
name = "leasingcar"
allowed_domains = ["http://www.leasingcar.dk"]
start_urls = ["http://www.leasingcar.dk/privatleasing",]

def parse(self, response):

    hxs = Selector(response)
    print hxs.xpath('//div[@class="data-nice_url"]/text()').extract()

Thanks in advance

score 0 · Accepted Answer · edited May 23 '17 at 12:10

0

The page is very "dynamic" and uses multiple XHR requests to different API endpoints to construct itself. After looking at these requests in browser developer tools, I would say it won't be easy for you to simulate them in your Scrapy code and it would be much easier to approach the problem with selenium - a browser automation tool. You can also use a "headless" PhantomJS browser or a virtual display.

In any case, make sure you are not violating the Terms of Use of the website and you are a good web-scraping citizen.

edited May 23 '17 at 12:10

Community

1
1

answered Jul 26 '15 at 01:51

alecxe

462,703
120
1,088
1,195

hi! would you mind help a bit? http://stackoverflow.com/questions/31630771/scrapy-linkextractor-duplicating – yukclam9 Jul 26 '15 at 03:40
Thanks for your answer @alecxe. I´ve already tried looking at the API but when I send the required URL in a post request I get an access denied. Could you elaborate on why the site being dynamic is hindering me in achieving my goal? I can see the data I want in the html so I assume they must be visible to scrapy? – Frank Jul 26 '15 at 08:33
@Frank that's the point, scrapy is not a browser and "sees" only the initial HTML which does not contain the desired data, it is loaded dynamically via additional requests and javascript being executed in the browser. That's why I'm proposing to take a high-level approach and automate a real browser which would do all the work, construct the page and then you would grab the results. – alecxe Jul 26 '15 at 14:41
@alexce I´ve looked into selenium like you suggested. I can get webdriver to open the page but I´m still having problems finding the property I want in the block. I´ve downloaded firebug and firepath but cant get the Xpath to reach the data that I want. I´ve tried: htmlElement = WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath(//div[@class="car-thumb-item clickable vehicle " and contains(text(), privatleasing)])) But it returns a arbitrary number of elements each time? – Frank Jul 27 '15 at 14:09
* and I only want the "data-nice_url" text – Frank Jul 27 '15 at 14:18
@Frank let's not solve new issues in comments. Could you please make a separate question? Thanks. – alecxe Jul 27 '15 at 19:12

Using Scrapy to retrieve nested data

1 Answers1