1

I have been scraping some websites using Python 2.7

    page = requests.get(URL)
    tree = html.fromstring(page.content)

    prices = tree.xpath('//span[@class="product-price"]/text()')
    titles = tree.xpath('//span[@class="product-title"]/text()')

This works fine for websites that have these clear tags in them but a lot of the websites I encounter have the following HTML setup:

<a href="https://www.retronintendokopen.nl/gameboy/games/gameboy-classic/populous" class="product-name"><strong>Populous</strong></a>

(I am tyring to extract the title: Populous) Where an href changes for every title I am extracting, I have tried the following for the above example hoping it would see the class and that would be enough but that doesn't work

titles = tree.xpath('//a[@class="product-name"]/text()')

I was searching for a character that would work like *, as in 'I don't care what's in here, just take everything with a href=.. But couldn't find anything

titles = tree.xpath('//a[@href="*"]/text()')

Also, would I need to specify that there is also class= in the a tag like

titles = tree.xpath('//a[@href="*" @class="product-name"]/text()')

EDIT: I also found a fix if there are only changing tags in the a path using

titles = tree.xpath('//h3/a/@title')

example for this tag

<h3><a href="http://www.a-retrogame.nl/index.php?id_product=5843&amp;controller=product&amp;id_lang=7" title="4 in 1 fun pack">4 in 1 fun pack</a></h3>
Alex
  • 47
  • 7

1 Answers1

1

try this:

titles = tree.xpath('//a[@class="product-name"]//text()')

notice // after class selector.

anekix
  • 2,393
  • 2
  • 30
  • 57