Scrapy - CSS selector issue

Question

I would like to get the link located in a href attribute from a aelement. The url is: https://www.drivy.com/location-voiture/antwerpen/bmw-serie-1-477429?address=Gare+d%27Anvers-Central&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-20&end_time=18%3A30&latitude=51.2162&longitude=4.4209&start_date=2019-05-20&start_time=06%3A00

I'm searching for the href of this element:

<a class="car_owner_section" href="/users/2643273" rel="nofollow"></a>

When I enter response.css('a.car_owner_section::attr(href)').get() in the terminal I just get nothing but the element exists even when I inspect view(response).

Anybody has a clue about this issue ?

Don’t trust `view(response)`, as soon as you open the page in a browser JavaScript code can alter the DOM. Check the actual page sources (Ctrl+U in Firefox, or write `response.body` to a file). See also https://stackoverflow.com/q/8550114/939364 — Gallaecio, May 10 '19 at 09:48
Thanks for the advice. I already know that some data are prompted with js etc... But here I just cannot figuring out where the link is gone. It's not in any json file I've found. — M. Coppée, May 10 '19 at 10:32
It may be in the page but in a different place, in HTML. Check the sources as suggested. — Gallaecio, May 10 '19 at 10:38
Already done. I've check the full HTML code. I've maybe missed it but I don't think so. The link is rendered by JS it's certain. — M. Coppée, May 10 '19 at 10:51

score 3 · Accepted Answer · answered May 10 '19 at 10:51

The site seems to load on JavaScript, using splash works perfect.

Here is the code:

import scrapy
from scrapy_splash import SplashRequest


class ScrapyOverflow1(scrapy.Spider):
    name = "overflow1"

    def start_requests(self):
        url = 'https://www.drivy.com/location-voiture/antwerpen/bmw-serie-1-477429?address=Gare+d%27Anvers-Central&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-20&end_time=18%3A30&latitude=51.2162&longitude=4.4209&start_date=2019-05-20&start_time=06%3A00'

        yield SplashRequest(url=url, callback=self.parse, args={'wait': 5})

    def parse(self, response):
        links = response.xpath('//a[@class="car_owner_section"]/@href').extract()
        print(links)

To use splash install splash, scrapy splash and run sudo docker run -p 8050:8050 scrapinghub/splash before running the spider. Here is a great article on installing and running splash. article on scrapy spash... and also add midlewares to settings.py (also in the article) The result is as above

Thanks a lot for the help ! I'm using scrapy with anaconda. Does this installation will be compatible with it ? — M. Coppée, May 10 '19 at 11:22
Awesome, glad i could help. Yea it is compatible. Alert me if you encounter an issue — Muhika Thomas, May 10 '19 at 11:27
While Splash is an option, it will have a significant performance hit on any crawling session. In the long term it’s better to figure out where the content is coming from, and extract the content directly. — Gallaecio, May 10 '19 at 12:00

Scrapy - CSS selector issue

1 Answers1