I have been scraping some websites using Python 2.7
page = requests.get(URL)
tree = html.fromstring(page.content)
prices = tree.xpath('//span[@class="product-price"]/text()')
titles = tree.xpath('//span[@class="product-title"]/text()')
This works fine for websites that have these clear tags in them but a lot of the websites I encounter have the following HTML setup:
<a href="https://www.retronintendokopen.nl/gameboy/games/gameboy-classic/populous" class="product-name"><strong>Populous</strong></a>
(I am tyring to extract the title: Populous) Where an href changes for every title I am extracting, I have tried the following for the above example hoping it would see the class and that would be enough but that doesn't work
titles = tree.xpath('//a[@class="product-name"]/text()')
I was searching for a character that would work like *, as in 'I don't care what's in here, just take everything with a href=.. But couldn't find anything
titles = tree.xpath('//a[@href="*"]/text()')
Also, would I need to specify that there is also class= in the a tag like
titles = tree.xpath('//a[@href="*" @class="product-name"]/text()')
EDIT: I also found a fix if there are only changing tags in the a path using
titles = tree.xpath('//h3/a/@title')
example for this tag
<h3><a href="http://www.a-retrogame.nl/index.php?id_product=5843&controller=product&id_lang=7" title="4 in 1 fun pack">4 in 1 fun pack</a></h3>