17

I have the following example of HTML:

<!-- lots of html -->
<h2>Foo bar</h2>
<p>lorem</p>
<p>ipsum</p>
<p>etc</p>

<h2>Bar baz</h2>
<p>dum dum dum</p>
<p>poopfiddles</p>
<!-- lots more html ... -->

I'm looking to extract all paragraphs following the 'Foo bar' header, until I reach the 'Bar baz' header (the text for the 'Bar baz' header is unknown, so unfortunately I can't use the answer provided by bougyman). Now I can of course using something like //h2[text()='Foo bar']/following::p but that of course will grab all paragraphs following this header. So I have the option to traverse the nodeset and push paragraphs into an Array until the text matches that of the next following header, but let's be honest, that's never as cool as being able to do it in XPath.

Is there a way to do this that I'm missing?

Lee Jarvis
  • 16,031
  • 4
  • 38
  • 40
  • Good question, +1. See my answer for a single XPath expression that sellects all "immediate following siblings" of the specified node. I also provide a more general XPath expression that can be used to find the "immediate following siblings" of any node. Extensive explanation is provided. – Dimitre Novatchev Jan 22 '11 at 15:41

7 Answers7

20

Use:

(//h2[. = 'Foo bar'])[1]/following-sibling::p
   [1 = count(preceding-sibling::h2[1] | (//h2[. = 'Foo bar'])[1])]

In case it is guaranteed that every h2 has a distinct value, this may be simplified to:

//h2[. = 'Foo bar']/following-sibling::p
   [1 = count(preceding-sibling::h2[1] | ../h2[. = 'Foo bar'])]

This means: Select all p elements that are following siblings of the h2 (first or only one in the document) whose string value is 'Foo bar' and also the first preceding sibling h2 for all these p elements is exactly the h2(first or only one in the document) whose string value is'Foo bar'`.

Here we use a method of finding whether two nodes are identical:

count($n1 | $n2) = 1

is true() exactly when the nodes $n1 and $n2 are the same node.

This expression can be generalized:

$x/following-sibling::p
       [1 = count(preceding-sibling::node()[name() = name($x)][1] | $x)]

selects all "immediate following siblings" of any node specified by $x.

Robin
  • 9,415
  • 3
  • 34
  • 45
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • 9
    *sigh* Why do I even bother to answer xpath questions with you around? I had hoped you were asleep ;) Mine is conceptually simpler (for me), but I'm sure yours is more performant. +1 – Phrogz Jan 22 '11 at 15:41
  • 6
    @phrogz: I am really sorry I woke up at 6 am on a Saturday morning and I had nothing better to do :) – Dimitre Novatchev Jan 22 '11 at 15:48
  • @Dimitre It's alright, my children woke me up at 7, so I take solace in the fact that I got one more hour than you. :D – Phrogz Jan 22 '11 at 15:49
  • @phrogz: As for comparison of the efficiency of our answers, I think that you are generally right that mine may be more efficient -- however this all depends on the optimizer used by the specific XPath implementation that is used. – Dimitre Novatchev Jan 22 '11 at 15:50
  • @phrogz: For good or for bad my daughter is now a freshman at the uni and usually sleeps much longer than I :) – Dimitre Novatchev Jan 22 '11 at 15:52
  • Very nice, I tried to write something like this but came across a couple of flaws your answer avoids. Major upvotes for both yourself and Phrogz! Thank you – Lee Jarvis Jan 22 '11 at 15:57
  • @Dimitre: +1 Good answer. As a minor: for this case, because "mark" is checked against siblings, you could use `../h2[. = 'Foo bar']` instead of `//h2[. = 'Foo bar']` absolute expression, besides that it's good for clarity. Also there is a typo in last `preceding-sibling::node` missing `()`. –  Jan 22 '11 at 16:17
  • @Alejandro: Thank you for noticing this. Also, your suggestion of using `../h2` insted of `//h2` is a significant performance improvement. Fixed now. – Dimitre Novatchev Jan 22 '11 at 16:26
3

This XPATH 1.0 statement selects all of the <p> that are siblings that follow an <h2> who's string value is equal to "Foo bar", that are also followed by an <h2> sibling element who's first preceding sibling <h2> has a string value of "Foo bar".

//p[preceding-sibling::h2[.='Foo bar']]
 [following-sibling::h2[
  preceding-sibling::h2[1][.='Foo bar']]]
Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
  • @Mads-Hansen: Your XPath expression does not select what you're saying it does. Your statement will become true if you replace the string "text()" with "string value", or if you modify the expression itself and replace '.' with 'text()' -- which I don't recommend. – Dimitre Novatchev Jan 22 '11 at 15:46
  • Yes, although I didn't think that HTML Heading elements had mixed content. For the purpose of this example the string value of an `

    ` that only has a text node is the same as`text()`.

    – Mads Hansen Jan 22 '11 at 16:05
3

In XPath 2.0 (I know this doesn't help you...) the simplest solution is probably

h2[. = 'Foo bar']/following-sibling::* except h2[. = 'Bar baz']/(.|following-sibling::* )

But like other solutions presented, this is likely (in the absence of an optimizer that recognizes the pattern) to be linear in the number of elements beyond the second h2, whereas you would really like a solution whose performance depends only on the number of elements selected. I've always felt it would be nice to have an until operator:

h2[. = 'Foo bar']/(following-sibling::* until . = 'Bar baz')

In its absence an XSLT or XQuery solution using recursion is likely to perform better when the number of nodes to be selected is small compared with the number of following siblings.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
3

Just because it's not between the answers, the classic XPath 1.0 set exclusion:

A - B = $A[count(.|$B)!=count($B)]

For this case:

(//h2[.='Foo bar']
    /following-sibling::p)
       [count(.|../h2[.='Foo bar']
                     /following-sibling::h2[1]
                        /following-sibling::p)
        != count(../h2[.='Foo bar']
                     /following-sibling::h2[1]
                        /following-sibling::p)]

Note: This would be the negation of Kaysian Method.

2

XPath 2.0 has the operator << (with $node1 << $node2 being true if $node1 precedes $node2) so that way you can use //h2[. = 'Foo bar']/following-sibling::p[. << //h2[. = 'Bar baz']]. I don't know however what nokogiri is respectively whether it supports XPath 2.0.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • Unfortunately it doesn't, that looks very cool though. Thanks for the response nonetheless, have an upvote. – Lee Jarvis Jan 22 '11 at 11:50
2

how about matching on the second one? If you only want the top section, match the second and grab everything above it .
doc.xpath("//h2[text()='Bar baz']/preceding-sibling::p").map { |m| m.text } => ["lorem", "ipsum", "etc"]

or if you don't know the second one, go another level with: doc.xpath("//h2[text()='Foo bar']/following-sibling::h2/preceding-sibling::p").map { |it| it.text } => ["lorem", "ipsum", "etc"]

Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
bougyman
  • 21
  • 2
  • Unfortunately I can't use the second header text as a selector because it's not unique, and the text could be anything, so I **have** to use the first header. – Lee Jarvis Jan 22 '11 at 12:35
  • I think the second suggestion should work well enough. Thanks! – Lee Jarvis Jan 22 '11 at 12:51
  • Ah my bad, terrible example.. there will also be paragraphs before the first header, meaning your second example will also grab these :( – Lee Jarvis Jan 22 '11 at 13:03
2
require 'nokogiri'

doc = Nokogiri::XML <<ENDXML
<root>
  <h2>Foo</h2>
  <p>lorem</p>
  <p>ipsum</p>
  <p>etc</p>

  <h2>Bar</h2>
  <p>dum dum dum</p>
  <p>poopfiddles</p>
</root>
ENDXML

a = doc.xpath( '//h2[text()="Foo"]/following::p[not(preceding::h2[text()="Bar"])]' )
puts a.map{ |n| n.to_s }
#=> <p>lorem</p>
#=> <p>ipsum</p>
#=> <p>etc</p>

I suspected that it might be more efficient to just walk the DOM using next_sibling until you hit the end:

node = doc.at_xpath('//h2[text()="Foo bar"]').next_sibling
stop = doc.at_xpath('//h2[text()="Bar baz"]')
a = []
while node && node!=stop
  a << node unless node.type == 3 # skip text nodes
  node = node.next_sibling
end

puts a.map{ |n| n.to_s }
#=> <p>lorem</p>
#=> <p>ipsum</p>
#=> <p>etc</p>

However, this is NOT faster. In a few simple tests, I found that xpath-only (the first solution) is about 2x as fast as this looping test, even when there are a very large number of paragraphs after the stop node. When there are many nodes to capture (and few after the stop) it performs even better, in the 6x-10x range.

Phrogz
  • 296,393
  • 112
  • 651
  • 745