2

I'm here to ask you some help with QXmlQuery and Xpath. I'm trying to use this combination to extract some data from several HTML documents. These documents are downloaded and then cleaned with the HTML Tidy Library.

The problem is when I try my XPath. Here is an example code :

[...]
    <ul class="bullet" id="idTab2">
        <li><span>Hauteur :</span> 1127 mm</li>
        <li><span>Largeur :</span> 640 mm</li>
        <li><span>Profondeur :</span> 685 mm</li>
        <li><span>Poids :</span> 159.6 kg</li>
[...]

The clean code is stored in a QString "code" :

QStringList fields, values;
QXmlQuery query;

query.setFocus(code);
query.setQuery("//*[@id=\"idTab2\"]/*/*/string()");
query.evaluateTo(&fields);

My goal is to get all the fields (Hauteur, Largeur, Profondeur, Poids, etc.) and their value (1127 mm, 640 mm, 685 mm, 159.6 kg, etc.).

Question 1

As you can see, I use this XPath //*[@id="idTab2"]/*/*/string() to recover the fields because this : //ul[@id="idTab2"]/li/span/string() doesn't work. When I try to specify a tag name, it gives me nothing. It only works with *. Why ? I've checked the code returned by the tidy function and the XPath is not altered. So, I don't see any prolem. Is this normal ? Or maybe there is something I don't know...

Question 2

In the previous XHTML code, the li tags wrap a span tag and some text. I don't know how to get only the text and not the content of the span tag. I tried :

//*[@id="idTab2"]/*/string() gives : Hauteur : 1127 mm Largeur : 640 mm Profondeur : 685 mm

//*[@id="idTab2"]/*[2]/string() gives : Nothing

So, if I'm not wrong, the text in the li tag is not considered as a child node but it should be. See the accepted answer : Select just text directly in node, not in child nodes.

Thanks for reading, I hope someone can help me.

Community
  • 1
  • 1
Pwet
  • 95
  • 1
  • 4

1 Answers1

1

To get the elements (not the text representation) inside the different <li>s, you can test the text content:

//*[@id=\"idTab2\"]/li[starts-with(span, "Hauteur")]

Same thing of other items:

//*[@id=\"idTab2\"]/li[starts-with(span, "Largeur")]
//*[@id=\"idTab2\"]/li[starts-with(span, "Profondeur")]
//*[@id=\"idTab2\"]/li[starts-with(span, "Poids")]

To get the string representation of these <li>, you can use string() around the whole expression, like this:

string(//*[@id=\"idTab2\"]/li[starts-with(span, "Poids")])

which gives "Poids : 159.6 kg"

To extract only the text node in the <li>, without the <span>, you can use these expressions, which select the text nodes which are direct children of <li> (<span> is not a text node), and removes the leading and trailing whitespace characters (normalize-space())

normalize-space(//*[@id=\"idTab2\"]/li[starts-with(span, "Hauteur")]/text())
normalize-space(//*[@id=\"idTab2\"]/li[starts-with(span, "Largeur")]/text())
normalize-space(//*[@id=\"idTab2\"]/li[starts-with(span, "Profondeur")]/text())
normalize-space(//*[@id=\"idTab2\"]/li[starts-with(span, "Poids")]/text())

The last on gives "159.6 kg"

paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
  • Hello @paul t., thank you for the answer but it doesn't seem to work too. I guess the problem comes from Qt, which doesn't seem to evaluate the XPath expressions as expected. I don't understand why... I'll search more about that. – Pwet Aug 26 '13 at 10:51
  • OK. Fyi, I tested those using `lxml` – paul trmbrth Aug 26 '13 at 12:47