2

I am trying to find the minimum value in a certain element from an XML document (it's actually a HTML table that is translated to XML). However, this does not work as intended.

The query is similar to the one used in How can I use XPath to find the minimum value of an attribute in a set of elements?. It looks like this:

/table[@id="search-result-0"]/tbody/tr[
    not(substring-before(td[1], " ") > substring-before(../tr/td[1], " "))
]

Executed on the example XML

<table class="tablesorter" id="search-result-0">
    <thead>
        <tr>
            <th class="header headerSortDown">Preis</th>
            <th class="header headerSortDown">Zustand</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td width="45px">15 CHF</td>
            <td width="175px">Ausgepack und doch nie gebraucht</td>
        </tr>
        <tr>
            <td width="45px">20 CHF</td>
            <td width="175px">Ausgepack und doch nie gebraucht</td>
        </tr>
        <tr>
            <td width="45px">25 CHF</td>
            <td width="175px">Ausgepack und doch nie gebraucht</td>
        </tr>
        <tr>
            <td width="45px">35 CHF</td>
            <td width="175px">Ausgepack und doch nie gebraucht</td>
        </tr>
        <tr>
            <td width="45px">14 CHF</td>
            <td width="175px">Gebraucht, aber noch in Ordnung</td>
        </tr>
        <tr>
            <td width="45px">15 CHF</td>
            <td width="175px">Gebraucht, aber noch in Ordnung</td>
        </tr>
        <tr>
            <td width="45px">15 CHF</td>
            <td width="175px">Gebraucht, aber noch in Ordnung</td>
        </tr>
    </tbody>
</table>

the query returns the following result:

<tr>
<td width="45px">15 CHF</td>
<td width="175px">Ausgepack und doch nie gebraucht</td>
</tr>
-----------------------
<tr>
<td width="45px">14 CHF</td>
<td width="175px">Gebraucht, aber noch in Ordnung</td>
</tr>
-----------------------
<tr>
<td width="45px">15 CHF</td>
<td width="175px">Gebraucht, aber noch in Ordnung</td>
</tr>
-----------------------
<tr>
<td width="45px">15 CHF</td>
<td width="175px">Gebraucht, aber noch in Ordnung</td>
</tr>

Why are there more nodes returned than one? There should only be exactly one node returned as there is only a single minimum. Does anybody see what's wrong with the query? It should only return the node containing the 14 CHF.

Results obtained using http://xpath.online-toolz.com/tools/xpath-editor.php

Community
  • 1
  • 1
str
  • 42,689
  • 17
  • 109
  • 127

3 Answers3

3

TML has already pointed out why your current path expression does not work, but has not suggested a working alternative.

The reason is simple, as @Tomalak has said:

I agree with Mathias. This actually is impossible in XPath 1.0 without changing the input XML.

I add this answer to elaborate on the way you'd have to preprocess your XML before searching for the minimum amount of CHF. And remember: This is so complicated because you asked for a solution in XPath 1.0. With XPath 2.0, your problem could be solved with a single path expression.


XML Design

I think that your question illustrates why XML design is actually essential when working with XML. Why? Because your problem boils down to the following: Your XML is designed in a way that makes it difficult to manipulate the content. More precisely, in a td element like this:

<td width="45px">15 CHF</td>

There is an amount (as a number) and a currency, both in the text node of the td element. If your XML input was designed in a more clever or canonical way, it would look like:

<td width="45px" currency="CHF">15</td>

See the difference? Now, different kinds of content are clearly separated from each other.


XPath Revised

Assuming that in the newly designed XML, the only content of a tr/td[1] element is the number, the XPath expression by Pavel Minaev that you used can be made to work:

/table[@id="search-result-0"]/tbody/tr[not(td[1] > ../tr/td[1])][1]

XML Result (tested with the tool you use)

<tr>
<td width="45px">14</td>
<td width="175px">Ausgepack und doch nie gebraucht</td>
</tr>

Why does Pavel's expression not work, simply because I add substring-before?

You found part of the answer yourself already. It has to do with how sequences of items are handled in XPath 1.0 functions.

substring-before() is an XPath 1.0 function that expects two arguments, both of them strings. And, most importantly, if you define a sequence of strings as the first argument of substring-before(), only the first string will be processed, the others will be ignored.

Pavel's answer, adapted to this question:

tr[not(td[1] > ../tr/td[1])][1]

Relies on the fact that the second part of the expression, ../tr/td[1], finds all first td child elements of all tr elements of tbody. There is no function involved, and there is nothing wrong with a sequence as the operand of >.

If we need substring-before() because the text content is actually both a number (that we want) and a currency (that we'd like to ignore), we have to wrap it around both parts of the expression:

tr[not(substring-before(td[1],' ') > substring-before(../tr/td[1],' '))][1]

No problem on the left side of >, because there is only one td[1] for the current tr. But on the right, there is a sequence of nodes, namely ../tr/td[1]. Sadly, substring-before() is only capable of processing the first of them.

See the answer by @TML for the consequences of that.

Community
  • 1
  • 1
Mathias Müller
  • 22,203
  • 13
  • 58
  • 75
1

The XPath query you're using here would only find the "minimum" in cases where there are no duplicate values, and the values are sorted prior to being written into nodes; this is because it's only comparing the current value substring-before(td[1], " ") to the first value found substring-before(../tr/td[1], " "). To break down the comparisons:

[1] not(15 > 15)
[2] not(20 > 15)
[3] not(25 > 15)
[4] not(35 > 15)
[5] not(14 > 15)
[6] not(15 > 15)
[7] not(15 > 15)

Comparisons 1, 5, 6, and 7 evaluate to true (the left-hand side is NOT greater than the right-hand side).

TML
  • 12,813
  • 3
  • 38
  • 45
  • You are right. Calling a function on a node set only returns the result of the first node instead of a set again. Any suggestions on how to solve this? – str Sep 23 '14 at 10:00
  • 2
    @str I am tempted to say that this is not possible in XPath 1.0. Can you manipulate the elements beforehand? If `substring-before` can be a separate step that is performed before you apply the XPath expression (so that `15` is left) - then I have a solution for you. – Mathias Müller Sep 23 '14 at 10:18
  • 2
    I agree with Mathias. This *actually is* impossible in XPath 1.0 without changing the input XML. – Tomalak Sep 23 '14 at 10:43
0

In the meantime I decided to use XSLT instead. This is the style sheet that I came up with:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">

    <xsl:output method="text" omit-xml-declaration="yes" indent="no" encoding="UTF-8"/>
    <xsl:strip-space elements="*"/> 

    <xsl:template match="//table[@id=\'search-result-0\']/tbody">
        <ul>
            <xsl:for-each select="tr/td[@width=\'45px\']">
                <xsl:sort select="substring-before(., \' \')" data-type="number" order="ascending"/>

                <xsl:if test="position() = 1">
                     <xsl:value-of select="substring-before(., \' \')"/>
                </xsl:if>
            </xsl:for-each>
        </ul>
    </xsl:template>

    <xsl:template match="text()"/> <!-- ignore the plain text -->

</xsl:stylesheet>
str
  • 42,689
  • 17
  • 109
  • 127