Python web-scraping with changing href

Question

I have been scraping some websites using Python 2.7

    page = requests.get(URL)
    tree = html.fromstring(page.content)

    prices = tree.xpath('//span[@class="product-price"]/text()')
    titles = tree.xpath('//span[@class="product-title"]/text()')

This works fine for websites that have these clear tags in them but a lot of the websites I encounter have the following HTML setup:

<a href="https://www.retronintendokopen.nl/gameboy/games/gameboy-classic/populous" class="product-name"><strong>Populous</strong></a>

(I am tyring to extract the title: Populous) Where an href changes for every title I am extracting, I have tried the following for the above example hoping it would see the class and that would be enough but that doesn't work

titles = tree.xpath('//a[@class="product-name"]/text()')

I was searching for a character that would work like *, as in 'I don't care what's in here, just take everything with a href=.. But couldn't find anything

titles = tree.xpath('//a[@href="*"]/text()')

Also, would I need to specify that there is also class= in the a tag like

titles = tree.xpath('//a[@href="*" @class="product-name"]/text()')

EDIT: I also found a fix if there are only changing tags in the a path using

titles = tree.xpath('//h3/a/@title')

example for this tag

<h3><a href="http://www.a-retrogame.nl/index.php?id_product=5843&amp;controller=product&amp;id_lang=7" title="4 in 1 fun pack">4 in 1 fun pack</a></h3>

http://stackoverflow.com/questions/3737906/xpath-how-to-check-if-an-attribute-exists and http://stackoverflow.com/questions/10247978/xpath-with-multiple-conditions — Ilja Everilä, Apr 25 '17 at 10:59
@nishantkumar no! beautifulsoup is not an ideal solution for scraping. `xpaths` are!. — anekix, Apr 25 '17 at 11:02
try scrapy. also, in xpath //a[@href] is used to prove existence — Andrew Scott Evans, Feb 08 '21 at 02:58

score 1 · Accepted Answer · answered Apr 25 '17 at 11:02

1

try this:

titles = tree.xpath('//a[@class="product-name"]//text()')

notice // after class selector.

answered Apr 25 '17 at 11:02

anekix

2,393
2
30
57

That was too easy haha, could you explain what the double operator actually does and the single didn't work? – Alex Apr 25 '17 at 11:24
@Alex double `//` means `any indirect child` i.e you can observe that after `` there is ``. so xpath couldn't find your `text` content as it was expecting it to be as `immediate` child of `` . that is why we need to have `//`. hope it clears – anekix Apr 25 '17 at 11:39

Python web-scraping with changing href

1 Answers1