7

I played with different XPath queries with XPather (works only with older firefox versions) and notice a difference between the results from the following queries

This one shows some results

//div[descendant::table/descendant::td[4]] 

This one lists empty list

//div[//table//td[4]]

Are they different due to some rules or it's just misbehavior of particular implementation of XPath interpreter? (Seems like used from FF engine, XPather is just an excellent simple GUI for querying)

Maksee
  • 2,311
  • 2
  • 24
  • 34

2 Answers2

11

With XPath 1.0 // is an abbreviation for /descendant-or-self::node()/ so your first path is /descendant-or-self::node()/div[descendant::table/descendant::td[4]] while the second is rather different with /descendant-or-self::node()/div[/descendant-or-self::node()/table/descendant-or-self::node()/td[4]]. So the major difference is that inside your first predicate you look down for descendants relative to the div element while in the second predicate you look down for descendants from the root node / (also called the document node). You might want //div[.//table//td[4]] for the second path expression to come closer to the first one.

[edit] Here is a sample:

<html>
  <body>
    <div>
      <table>
        <tbody>
          <tr>
            <td>1</td>
          </tr>
          <tr>
            <td>2</td>
          </tr>
          <tr>
            <td>3</td>
          </tr>
          <tr>
            <td>4</td>
          </tr>
        </tbody>
      </table>
    </div>
  </body>
</html>

With that sample the path //div[descendant::table/descendant::td[4]] selects the div element as it has a table child which has a fourth td descendant.

However with //div[.//table//td[4]] we look for //div[./descendant-or-self::node()/table/descendant-or-self::node()/td[4]] which is short for //div[./descendant-or-self::node()/table/descendant-or-self::node()/child::td[4]] and there is no element having a fourth td child element.

I hope that explains the difference, if you use //div[.//table/descendant::td[4]] then you should get the same result as with your original form.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • 1
    Thanks, but your fixed version doesn't work also. Checked this page URL with XPather/FF3.6.28 (available on PortableApps). The full syntax returns 9 results, while the short one with dot 0 results. Are there any other (site/program) with a different engine to check? – Maksee Apr 07 '12 at 11:35
  • Consider to post a sample of the XML or HTML you are querying with XPath. I think I have given the right explanation as to what `//` stands for, also see http://www.w3.org/TR/xpath/#path-abbrev. The long syntax with `descendant-or-self::node()` is difficult to read and write so I managed to make two mistakes when writing the path samples, I hope I have corrected them now. There is still a difference between `[descendant::table/descendant::td[4]]` and `[.//table//td[4]]`, it is best to talk about that with sample data. – Martin Honnen Apr 07 '12 at 11:42
  • The sample data is this page (sorry if I wasn't clear), so the contents of this page we're discussing in :) – Maksee Apr 07 '12 at 11:52
  • If you want to use abbreviated syntax as much as possible then doing `//div[.//table/descendant::td[4]]` is as far as you can go I think to get the same result as with `//div[descendant::table/descendant::td[4]]`. I will edit my post to show a sample illustrating the difference. – Martin Honnen Apr 07 '12 at 12:31
  • I think I now understand. The position() is used from enumeration of the axis and for descendant they're all td below, not only children, right? – Maksee Apr 07 '12 at 13:10
  • The positional predicate `[4]` applies to the nodes selected in that step it is appended to and in `//div[descendant:table/descendant::td[4]]` the step selects descendant `td` elements of the `table` of which there are four. In the sample `//div[.//table//td[4]]` the step to which the predicate applies after expansion of the abbreviated syntax is `child::td[4]`. So there are differences between doing `descendant::foo[n]` and `//foo[n]]`. If you prefer the abbreviated syntax then for the sample doing `//div[(.//table//td)[4]]` should work too. – Martin Honnen Apr 07 '12 at 16:12
  • is this outlined anywhere in their docs? feel like they dont cover the actual commands? like xpath or .css() – filthy_wizard Nov 12 '19 at 16:47
4

There's an important note in W3C document on XPath 1.0 (W3C Recommendation 16 November 1999):

XML Path Language (XPath) Version 1.0
    2 Location Paths
        2.5 Abbreviated Syntax

NOTE: The location path //para[1] does not mean the same as the location path /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their parents.

Simlar note in the document on XPath 3.1 (W3C Recommendation 21 March 2017)

XML Path Language (XPath) 3.1
    3 Expressions
        3.3 Path Expressions
            3.3.5 Abbreviated Syntax

NOTE: The path expression //para[1] does not mean the same as the path expression /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their respective parents.

That means the double slash inside the path is not just a shortcut for /descendant-or-self::node()/ but also a starting point for next level of an XML tree iteration, which implies the step expression to the right of // is re-run on each descendant of the current context node.

So the exact meaning of the predicate in this path

//div[ descendant::table/descendant::td[4] ]

is:

  • build a sequence of all <table> nodes descendant to the current <div>,
  • for every such <table> build a sequence of all descendant <td> elements and concatenate them into a single sequence,
  • filter that sequence for its fourth item.

Finally the path returns all <div> elements in the document, which have at least four data cells in all their nested tables. And since there are tables in the document which have 4 cells or more (including cells in nested tables, of course), the whole expression selects their respective <div> ancestors.

On the other hand the predicate in

//div[ //table//td[4] ]

means:

  • scan the whole document tree for <table> elements (more precisely, test the root node and every root's descendant if it has a <table> child),
  • for every table found scan its subtree for elements having a fourth <td> subelement (i.e. test if the table or any of its descendants has at least four <td> children).

Please note the predicate subexpression does not depend on the context node. It is a global path, resolving to some sequence of nodes (possibly empty), thus the predicate boolean value depends only on the document's structure. If it is true the whole path returns a sequence of all <div> elements in the document, else the empty sequence.

Finally the predicate would be true iff there was an element in any table, having 4 (at least) data cells.
And as far as I can see all <tr> rows contain two or three cells - there is no element with 4 or more <td> children, so the predicate subexpression returns en empty sequence, the predicate is false and the whole path gets filtered out. Result is: nothing (empty sequence).

Community
  • 1
  • 1
CiaPan
  • 9,381
  • 2
  • 21
  • 35