5

Given markup like:

<p>
  <code>foo</code><code>bar</code>
  <code>jim</code> and then <code>jam</code>
</p>

I need to select the first three <code>—but not the last. The logic is "Select all code elements that have a preceding-or-following-sibling-element that is also a code, unless there exist one or more text nodes with non-whitespace content between them.

Given that I am using Nokogiri (which uses libxml2) I can only use XPath 1.0 expressions.

Although a tricky XPath expression is desired, Ruby code/iterations to perform the same on a Nokogiri document are also acceptable.

Note that the CSS adjacent sibling selector ignores non-element nodes, and so selecting nokodoc.css('code + code') will incorrectly select the last <code> block.

Nokogiri.XML('<r><a/><b/> and <c/></r>').css('* + *').map(&:name)
#=> ["b", "c"]

Edit: More test cases, for clarity:

<section><ul>
  <li>Go to <code>N</code> and
      then <code>Y</code><code>Y</code><code>Y</code>.
  </li>
  <li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>

All the Y above should be selected. None of the N should be selected. The content of the <code> are used only to indicate which should be selected: you may not use the content to determine whether or not to select an element.

The context elements in which the <code> appear are irrelevant. They may appear in <li>, they may appear in <p>, they may appear in something else.

I want to select all the consecutive runs of <code> at once. It is not a mistake that there is a space character in the middle of one of sets of Y.

Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • The "non-whitespace" content makes this pretty tricky in xpath – MattH Jun 25 '12 at 22:43
  • @MattH I'd imagine. I could _almost_ accept a version that prohibited any intervening non-element nodes, but I _believe_ that I have seen some cases of a single space between them when I need it to match. – Phrogz Jun 25 '12 at 22:44
  • Would a regex be fine in this case? – Jwosty Jun 25 '12 at 23:22
  • @Jwosty It would not; I have a Nokogiri DOM of the page already that I am manipulating. Roundtripping through a `to_s` and re-parsing just to [use a regex to manipulate HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) would force me to vomit more than a little. ;) – Phrogz Jun 26 '12 at 05:05
  • Ah, okay. I guess that would be a bit complicated to do so... :P – Jwosty Jun 26 '12 at 13:16

3 Answers3

4

Use:

//code
     [preceding-sibling::node()[1][self::code]
    or
      preceding-sibling::node()[1]
         [self::text()[not(normalize-space())]]
     and
      preceding-sibling::node()[2][self::code]
    or
     following-sibling::node()[1][self::code]
    or
      following-sibling::node()[1]
         [self::text()[not(normalize-space())]]
     and
      following-sibling::node()[2][self::code]
     ]

XSLT - based verification:

<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output omit-xml-declaration="yes" indent="yes"/>

     <xsl:template match="/">
      <xsl:copy-of select=
       "//code
             [preceding-sibling::node()[1][self::code]
            or
              preceding-sibling::node()[1]
                 [self::text()[not(normalize-space())]]
             and
              preceding-sibling::node()[2][self::code]
            or
             following-sibling::node()[1][self::code]
            or
              following-sibling::node()[1]
                 [self::text()[not(normalize-space())]]
             and
              following-sibling::node()[2][self::code]
             ]"/>
     </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<section><ul>
      <li>Go to <code>N</code> and
          then <code>Y</code><code>Y</code><code>Y</code>.
      </li>
      <li>If you see <code>N</code> or <code>N</code> then…</li>
    </ul>
    <p>Elsewhere there might be: <code>N</code></p>
    <p><code>N</code> across parents.</p>
    <p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
    <p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>

the contained XPath expression is evaluated and the selected nodes are copied to the output:

<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • Out of curiosity, what is the point/benefit of `/*/code` vs. `//code`? – Phrogz Jun 26 '12 at 04:11
  • @Phrogz: Many XPath 1.0 implementations are very slow in evaluating `//someName` -- they traverse the whole subtree. When we know the structure of the document, we can specify the exact path to the wanted elements and this may be many times faster. – Dimitre Novatchev Jun 26 '12 at 04:34
  • @Phrogz: I edited my answer and now the expression is much simpler. – Dimitre Novatchev Jun 26 '12 at 04:35
  • This will also incorrectly select


    oops

    .
    – pguardiario Jun 26 '12 at 05:06
  • @Phrogz: The question isn't specified precisely and this causes confusion. It is possible to have more than one group of "adjacent" `code` elements as per your explanation. Do you want the `code` elements from *all* groups to be selected? – Dimitre Novatchev Jun 26 '12 at 05:24
  • @pguardiario: Yes, there are different understandings possible of the question, as specified at present -- see my previous comment. The previous version of my answer was very different from this one. Later, I thought that the OP wants in fact something else -- so thus the present version of the answer. Once Phrogz dispells this ambiguity, I will be glad to provide a correct answer. – Dimitre Novatchev Jun 26 '12 at 05:28
  • @Dimitre Yes: I want to find all adjacent-sibling `` everywhere in the document, not just the first consecutive run or _nth_ consecutive. I will edit the question with more test cases. – Phrogz Jun 26 '12 at 12:58
  • @Phrogz: OK, then my first proposed solution selects these -- I'll revert to it. – Dimitre Novatchev Jun 26 '12 at 13:18
  • @Phrogz: See my last edit -- the XPath expression selects exactly the wanted `code` elements. In the process of rollbacks lost one upvote :) – Dimitre Novatchev Jun 26 '12 at 13:29
  • Dimitre gets the accept for the XSLT verification and the slightly cleaner XPath. Thank you @matt, too, for your good work. – Phrogz Jun 26 '12 at 16:37
  • @Phrogz: You are welcome. Keep up asking very interesting and challenging XPath questions. – Dimitre Novatchev Jun 26 '12 at 16:41
3
//code[
  (
    following-sibling::node()[1][self::code]
    or (
      following-sibling::node()[1][self::text() and normalize-space() = ""]
      and
      following-sibling::node()[2][self::code]
    )
  )
  or (
    preceding-sibling::node()[1][self::code]
    or (
      preceding-sibling::node()[1][self::text() and normalize-space() = ""]
      and
      preceding-sibling::node()[2][self::code]
    )
  )
]

I think this does what you want, though I won’t claim you’d actually want to use it.

I’m assuming text nodes are always merged together so that there won’t be two adjacent to each other, which I believe is generally the case, but might not be if you’re doing DOM manipulations beforehand. I’ve also assumed that there won’t be any other elements between code elements, or that if there are they prevent selection like non-whitespace text.

matt
  • 78,533
  • 8
  • 163
  • 197
  • 1UP: Agrees with what I think the OP wants and a lot cleaner than the xpath I'd been hacking at. – MattH Jun 26 '12 at 08:01
1

I think this is what you want:

/p/code[not(preceding-sibling::text()[not(normalize-space(.)="")])]
pguardiario
  • 53,827
  • 19
  • 119
  • 159