How to select only leaf nodes with Nokogiri?

Question

I am looking for some advices on how it could be done. I'm trying a solution only with xpath:

An html example:

<div>
  <div>
    <div>text div (leaf)</div>
    <p>text paragraph (leaf)</p>
  </div>
</div>
<p>text paragraph 2 (leaf)</p>

Code:

doc = Nokogiri::HTML.fragment("- the html above -")
result = doc.xpath("*[not(child::*)]")


[#<Nokogiri::XML::Element:0x3febf50f9328 name="p" children=[#<Nokogiri::XML::Text:0x3febf519b718 "text paragraph 2 (leaf)">]>]

But this xpath only gives me the last "p". What I want is like a flatten behavior, only returning the leaf nodes.

Here are some reference answers in stackoverflow:

How to select all leaf nodes using XPath expression?

XPath - Get node with no child of specific type

Thanks

@Luccas: Do you want just the text, or do you want the containing element as well? i.e. do you want `text paragraph (leaf)` or `
text paragraph (leaf)
`? And if you want just the text, do you want all the text nodes separately, or do you simply want all the text concantenated as a single string? — Borodin, Jul 26 '13 at 22:55
The reason your attempt failed was because you used `xpath('*…')` instead of `xpath('.//*…')`; see [this bug report](https://github.com/sparklemotion/nokogiri/issues/213) and [this one](https://github.com/sparklemotion/nokogiri/issues/572). — Phrogz, Jul 27 '13 at 14:03

score 7 · Answer 1 · edited Jul 26 '13 at 21:22

You can find all element nodes that have no child elements using:

//*[not(*)]

Example:

require 'nokogiri'

doc = Nokogiri::HTML.parse <<-end
<div>
  <div>
    <div>text div (leaf)</div>
    <p>text paragraph (leaf)</p>
  </div>
</div>
<p>text paragraph 2 (leaf)</p>
end

puts doc.xpath('//*[not(*)]').length
#=> 3

doc.xpath('//*[not(*)]').each do |e|
    puts e.text
end
#=> "text div (leaf)"
#=> "text paragraph (leaf)"
#=> "text paragraph 2 (leaf)"

score 3 · Accepted Answer · answered Jul 26 '13 at 20:16

The problem with your code is the statement:

doc = Nokogiri::HTML.fragment("- the html above -")

See here:

require 'nokogiri'

html = <<END_OF_HTML
<div>
  <div>
    <div>text div (leaf)</div>
    <p>text paragraph (leaf)</p>
  </div>
</div>
<p>text paragraph 2 (leaf)</p>
END_OF_HTML


doc = Nokogiri::HTML(html)
#doc = Nokogiri::HTML.fragment(html)
results = doc.xpath("//*[not(child::*)]")
results.each {|result| puts result}

--output:--
<div>text div (leaf)</div>
<p>text paragraph (leaf)</p>
<p>text paragraph 2 (leaf)</p>

If I run this:

doc = Nokogiri::HTML.fragment(html)
results = doc.xpath("//*[not(child::*)]")
results.each {|result| puts result}

I get no output.

See https://github.com/sparklemotion/nokogiri/issues/213 and https://github.com/sparklemotion/nokogiri/issues/572 — Phrogz, Jul 27 '13 at 14:07

score 2 · Answer 3 · answered Jul 26 '13 at 21:38

In XPath, the text itself is a node - so given your comment you would only want to select the tag contents, not the tags containing the content - but you would capture a <br/> (if there was one).

I guess you're looking for all elements not containing other elements (tags) (which is not exactly what you've been asking for) - then you're fine with @Justin Ko's answer and use the XPath expression

//*[not(*)]

If you really want to look for all leaf nodes, you cannot use the * selector, but need to use node():

//node()[not(node())]

Nodes can be elements, but also text nodes, comments, processing instructions, attributes and even XML documents (but those cannot occur within other elements).

If you'd really only want the text nodes, go for //text() like @Priti proposed, which indeed somewhat selects exactly the nodes you're asking for (by highlighting them, not by what leaf nodes are defined as).

How to select only leaf nodes with Nokogiri?

3 Answers3