I am struggling with the ImportXML
function from Google Spreadsheet. A specific webpage has this content:
<li class="myTag">
<a href="/xyz/abc/">
Jouets
<small class="hidden">
<span>235 6453 sous-section</span>
<span>1231 765 produits</span>
<span>12155 654 images</span>
</small>
</a>
</li>
<li class="myTag">
<a href="/xyz/def/">
Livres
<small class="hidden">
<span>235 6453 sous-section</span>
<span>1231 765 produits</span>
<span>12155 654 images</span>
</small>
</a>
</li>
I want to extract all li
tags with the class="myTag"
but without the <small>
section, in order to have this list:
- Jouets
- Livres
I have written this XPath expression:
=IMPORTXML(B1; '//li[@class='myTag'].ReplaceAll("<[^>]*>", "", "")')
But I have the #N/A
error.
How can I extract just the first elements (Jouets, Livres) without the <small> section?
Originality of this question
Scraping data to Google Sheets from a website that uses JavaScript is about dynamic web pages which is not my case. My page is already generated in HTML content - I can already extract the full content within the <li>
tag with the parameter '//li[@class='myTag']'
.
My point is to exclude the <small>
content which is glued.
I am looking to get this list:
- Jouets
- Livres
and not this one:
- Jouets235 6453 sous-section1231 765 produits12155 654 images
- Livres235 6453 sous-section1231 765 produits12155 654 images