Xpath on HTML: difference between HTML source and DOM model

Question

I use the jQuery Xpath plugin to browse through the hierarchy of a HTML document. I do not use selectors for the reason that I need to process information of a server-side framework which tells me the Xpath to a specific element.

Now I observed that the DOM does not necessarily represent the HTML-document's hierarchy and found the solution to this problem here: Why does firebug add <tbody> to <table>?. This means, that if my HTML document does for example contain the following code:

<table>
  <tr>
    <td>Hello</td>
  </tr>
</table>

My DOM will represent the latter like this:

<table>
  <tbody>
    <tr>
      <td>Hello</td>
    </tr>
  </tbody>
</table>

My Xpath query

jQuery(document).xpath('//table[1]/tr[1]/td[1]')

does therefore not longer yield a result.

Is there a way of avoiding the synthetic elements of the DOM representation? Or a way to adjust the Xpath such that it includes the synthetic elements? Thanks for any help.

Either the framework that's giving you the XPaths is buggy, or you have a mismatch between your document being treated as XML/XHTML on the server and as HTML on the browser. — Alohci, Jul 17 '13 at 23:34
Well, the framework is on the server side and manipulates and analyses a document's source code. I try to extend the framework in a way that it uses these path locations on the client side. — Rafael Winterhalter, Jul 18 '13 at 06:08

Rafael Winterhalter · Accepted Answer · 2013-07-18T10:12:27.477

Well, with the help of jQuery I fabricated this alternative XPath-parser that works for my use case scenario. The parser tries to stay on the XPath specified by my input, but if the DOM model adds a new tag in the middle of the path where the remainder of the path is wrapped in this one single element, the parser recognizes this addition and includes this single elements into the path. This will of course not work for everybody and every use case scenario, but it works for mine. Maybe this solution is of help to anybody else, at least after some extension:

var SloppyXPathParser = (function () {

    function childExists($cursor, element) {
        assertSelection($cursor);
        var $movedCursor = $cursor.children(element.name);
        if ($movedCursor.size() > element.index) {
            return jQuery($movedCursor.get(element.index));
        } else if ($cursor.children().size() == 1) {
            return childExists(jQuery($cursor.children().get(0)), element);
        } else {
            throw 'Cannot browse to \'' + element.name + '\' at index ' + element.index + '\'';
        }
    }

    function assertSelection($cursor) {
        if (!($cursor instanceof jQuery) || $cursor.size() != 1) {
            throw 'Selection is invalid: ' + $cursor.size();
        }
    }

    function parsePath(rawPath) {
        var nodes = rawPath.split('/');
        var regex = new RegExp('([a-zA-Z]+)\\[([0-9]+)\\]');
        var elements = [];
        var index = 0;
        jQuery(nodes).each(function (key, element) {
            if (element.length == 0) {
                return true;
            }
            if (!regex.test(element)) {
                throw 'Path element does not match regex: ' + element;
            }
            var matched = regex.exec(element);
            elements[index++] = { name: matched[1], index: matched[2] };
        });
        return elements;
    }

    function findElement(input) {

        var elements = parsePath(input);
        var $cursor = jQuery(document);
        jQuery(elements).each(function (key, element) {
            $cursor = childExists($cursor, element);
        });

        try {
            assertSelection($cursor);
        } catch (cause) {
            console.log('Exception: ' + cause);
            return false;
        }

        return $cursor.get(0);
    }

    return {
        find: function (input) {
            return findElement(input);
        }
    }
})();

var input = '/html[0]/body[0]/table[0]/tr[1]/td[1]';
SloppyXPathParser.find(input);

with the HTML source being:

<html>
  <body>
    <table>
      <tr>
        <td>wrong</td>
        <td>wrong</td>
      </tr>
      <tr>
        <td>wrong</td>
        <td>right</td>
      </tr>
  </table>
  </body>
</html>

You can check by e.g. Firebug that the browser adds a tbody element to the DOM. The parser will recognize this and skip the entry.

score 0 · Answer 2 · edited May 23 '17 at 12:05

0

If you haven't got any nested tables, jQuery(document).xpath('//table[1]//tr[1]/td[1]') should work in both cases.

In a more general case, you can adapt from the answer to How do you select child-or-self (children + self)

In XPath 1.0, this would translate to jQuery(document).xpath('(//table|//table/tbody)/tr[1]/td[1]') or even more generally to jQuery(document).xpath('(//table|//table/node())/tr[1]/td[1]')

edited May 23 '17 at 12:05

Community

1
1

answered Jul 17 '13 at 15:27

paul trmbrth

20,518
4
53
66

This would work for this one particular case, but not for general input. – Rafael Winterhalter Jul 18 '13 at 06:01
Edited with solution adapted from http://stackoverflow.com/questions/4311470/how-do-you-select-child-or-self-children-self – paul trmbrth Jul 18 '13 at 07:34

score 0 · Answer 3 · answered Jul 18 '13 at 09:11

0

Simply turn a single / into a double for your tr:

//table[1]/tr[1]/td[1] -> //table[1]//tr[1]/td[1]

This will match a table row at any depth below the initial table tag, so you can add as many <tbody> tags as you like.

answered Jul 18 '13 at 09:11

Tro

897
9
32

Well, what if the document contains a nested table? This will add ambiguity. – Rafael Winterhalter Jul 18 '13 at 09:12
Assuming you were no longer going to only match the first `` and `` then yes. You could consider adding a class attribute to help determine which level any `` tags are on, but I guess that would be a bit of a dirty hack. – Tro Jul 18 '13 at 09:19

Xpath on HTML: difference between HTML source and DOM model

3 Answers3