32

Is there a library that can give me the XPATH for all the nodes in an HTML page?

rath3r
  • 323
  • 1
  • 6
  • 19
user583726
  • 647
  • 2
  • 9
  • 20

2 Answers2

52

is there any library that can give me XPATH for all the nodes in HTML page

Yes, if this HTML page is a well-formed XML document.

Depending on what you understand by "node"...

//*

selects all the elements in the document.

/descendant-or-self::node()

selects all elements, text nodes, processing instructions, comment nodes, and the root node /.

//text()

selects all text nodes in the document.

//comment()

selects all comment nodes in the document.

//processing-instruction()

selects all processing instructions in the document.

//@* 

selects all attribute nodes in the document.

//namespace::*

selects all namespace nodes in the document.

Finally, you can combine any of the above expressions using the union (|) operator.

Thus, I believe that the following expression really selects "all the nodes" of any XML document:

/descendant-or-self::node() | //@* | //namespace::*
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • 2
    `//node()` does not select the root because it's expanded to `/descendant-or-self::node()/child::node()`. In fact `node()` pattern doesn't match the document root. –  Apr 14 '11 at 19:13
  • @Alejandro: Good catch, fixed. As for selecting the document root, it still matches `node()` as in `ancestor::node()` or `self::node()` – Dimitre Novatchev Apr 14 '11 at 22:34
  • Sorry, I should said _"`node()` as a pattern"_. –  Apr 14 '11 at 22:41
  • 1
    Just found it helpful for Delphi – user30478 May 27 '18 at 06:40
2

In case this is helpful for someone else, if you're using python/lxml, you'll first need to have a tree, and then query that tree with the XPATH paths that Dimitre lists above.

To get the tree:

import lxml
from lxml import html, etree

your_webpage_string = "<html><head><title>test<body><h1>page title</h3>"
root = lxml.html.fromstring(your_webpage_string)
good_html = etree.tostring(root, pretty_print=True).strip()
your_tree = etree.fromstring(good_html)
all_xpaths = your_tree.xpath('//*') 

On the last line, replace '//*' with whatever xpath you want. all_xpaths is now a list which looks like this:

[<Element html at 0x7ff740b24b90>,
 <Element head at 0x7ff740b24d88>,
 <Element title at 0x7ff740b24dd0>,
 <Element body at 0x7ff740b24e18>,
 <Element h1 at 0x7ff740b24e60>]
Aeryes
  • 629
  • 3
  • 10
  • 24
tegan
  • 2,125
  • 2
  • 14
  • 17