Is there a library that can give me the XPATH for all the nodes in an HTML page?
-
Which language are you using? – samplebias Apr 13 '11 at 01:06
-
1//node() is the Xpath for all of the nodes. – Steven D. Majewski Apr 13 '11 at 01:40
-
Good question, +1. See my answer for an exhaustive solution. :) – Dimitre Novatchev Apr 13 '11 at 04:25
-
@samplebias : JAVA would be better. but I don't mind even if It's PHP or Perl. – user583726 Apr 13 '11 at 18:33
-
2@Steven D. Majewski: No. It isn't. – Apr 14 '11 at 19:00
2 Answers
is there any library that can give me XPATH for all the nodes in HTML page
Yes, if this HTML page is a well-formed XML document.
Depending on what you understand by "node"...
//*
selects all the elements in the document.
/descendant-or-self::node()
selects all elements, text nodes, processing instructions, comment nodes, and the root node /
.
//text()
selects all text nodes in the document.
//comment()
selects all comment nodes in the document.
//processing-instruction()
selects all processing instructions in the document.
//@*
selects all attribute nodes in the document.
//namespace::*
selects all namespace nodes in the document.
Finally, you can combine any of the above expressions using the union (|
) operator.
Thus, I believe that the following expression really selects "all the nodes" of any XML document:
/descendant-or-self::node() | //@* | //namespace::*

- 240,661
- 26
- 293
- 431
-
2`//node()` does not select the root because it's expanded to `/descendant-or-self::node()/child::node()`. In fact `node()` pattern doesn't match the document root. – Apr 14 '11 at 19:13
-
@Alejandro: Good catch, fixed. As for selecting the document root, it still matches `node()` as in `ancestor::node()` or `self::node()` – Dimitre Novatchev Apr 14 '11 at 22:34
-
-
1
In case this is helpful for someone else, if you're using python/lxml, you'll first need to have a tree, and then query that tree with the XPATH paths that Dimitre lists above.
To get the tree:
import lxml
from lxml import html, etree
your_webpage_string = "<html><head><title>test<body><h1>page title</h3>"
root = lxml.html.fromstring(your_webpage_string)
good_html = etree.tostring(root, pretty_print=True).strip()
your_tree = etree.fromstring(good_html)
all_xpaths = your_tree.xpath('//*')
On the last line, replace '//*' with whatever xpath you want. all_xpaths
is now a list which looks like this:
[<Element html at 0x7ff740b24b90>,
<Element head at 0x7ff740b24d88>,
<Element title at 0x7ff740b24dd0>,
<Element body at 0x7ff740b24e18>,
<Element h1 at 0x7ff740b24e60>]