1

I am struggling with the ImportXML function from Google Spreadsheet. A specific webpage has this content:

<li class="myTag">
    <a href="/xyz/abc/">
        Jouets
        <small class="hidden">
            <span>235&nbsp;6453&nbsp;sous-section</span>
            <span>1231&nbsp;765&nbsp;produits</span>
            <span>12155&nbsp;654&nbsp;images</span>
        </small>
    </a>
</li>
<li class="myTag">
    <a href="/xyz/def/">
        Livres
        <small class="hidden">
            <span>235&nbsp;6453&nbsp;sous-section</span>
            <span>1231&nbsp;765&nbsp;produits</span>
            <span>12155&nbsp;654&nbsp;images</span>
        </small>
    </a>
</li>

I want to extract all li tags with the class="myTag" but without the <small> section, in order to have this list:

  • Jouets
  • Livres

I have written this XPath expression:

=IMPORTXML(B1; '//li[@class='myTag'].ReplaceAll("<[^>]*>", "", "")')

But I have the #N/A error.

How can I extract just the first elements (Jouets, Livres) without the <small> section?


Originality of this question

Scraping data to Google Sheets from a website that uses JavaScript is about dynamic web pages which is not my case. My page is already generated in HTML content - I can already extract the full content within the <li> tag with the parameter '//li[@class='myTag']'.

My point is to exclude the <small> content which is glued.
I am looking to get this list:

  • Jouets
  • Livres

and not this one:

  • Jouets235 6453 sous-section1231 765 produits12155 654 images
  • Livres235 6453 sous-section1231 765 produits12155 654 images
Rubén
  • 34,714
  • 9
  • 70
  • 166
macaaron
  • 13
  • 4
  • Can you provide the URL for testing it? – Tanaike Feb 05 '23 at 23:05
  • Yes: https://www.universalis.fr/classification/ – macaaron Feb 06 '23 at 12:01
  • Thank you for replying and providing the URL. When I saw the HTMl of your URL, I cannot find the values of `Jouets` and `Livres`. And also, I cannot find the "li" tag with the class of `myTag`. I apologize for this. – Tanaike Feb 06 '23 at 12:13
  • Sorry, I anonymised the content. Here is the full importxml function: =IMPORTXML(https://www.universalis.fr/classification/; '//li[@class='with-counter']') – macaaron Feb 06 '23 at 12:17
  • Thank you for replying. About `I anonymised the content. `, I couldn't notice it. I apologize for this. – Tanaike Feb 06 '23 at 12:17
  • Did you succeed with the test? `=IMPORTXML('https://www.universalis.fr/classification/'; '//li[@class='with-counter']')` – macaaron Feb 06 '23 at 17:20
  • Thank you for replying. Unfortunately, from your reply, I couldn't understand your expected values. But I would like to try to understand it. When I could correctly understand it, I would like to think of a solution. When you use your provided formula of `=IMPORTXML('https://www.universalis.fr/classification/'; '//li[@class='with-counter']')`, what did you obtain the values and what values do you want to retrieve? So, can I ask you about the detail of your expected value? – Tanaike Feb 07 '23 at 06:31
  • Thank you. Currently, with this formula I have this result (just pasting the first lines): - `Arts1 069 sous-thèmes15 778 articles11 949 médias` ; - `Astronomie et astrophysique107 sous-thèmes570 articles696 médias` ; - `Chimie147 sous-thèmes1 042 articles1 087 médias` But I am attempting to get only: - `Arts` ; - `Astronomie et astrophysique` ; - `Chimie` – macaaron Feb 07 '23 at 09:27
  • Thank you for replying. From your reply, I proposed an answer. Please confirm it. If I misunderstood your expected values, I apologize. – Tanaike Feb 07 '23 at 11:25

1 Answers1

0

From your following reply,

Currently, with this formula I have this result (just pasting the first lines): - Arts1 069 sous-thèmes15 778 articles11 949 médias ; - Astronomie et astrophysique107 sous-thèmes570 articles696 médias ; - Chimie147 sous-thèmes1 042 articles1 087 médias But I am attempting to get only: - Arts ; - Astronomie et astrophysique ; - Chimie

How about the following XPath?

Modified XPath:

//li[@class='with-counter']/a/text()

Modified formula:

=IMPORTXML("https://www.universalis.fr/classification/"; "//li[@class='with-counter']/a/text()")

Testing:

enter image description here

Tanaike
  • 181,128
  • 11
  • 97
  • 165
  • @macaaron Thank you for replying and testing it. I could correctly understand your question with your cooperation. Thank you, too. – Tanaike Feb 07 '23 at 12:08