5

I have XML like this:

<span>1</span>
<span class="x">2</span>
<span class="x y">3</span>
<span class="x">4</span>
<span>5</span>
<span class="x">6</span>
<span>7</span>
<span class="x">8</span>

What I want is to use an XSLT stylesheet to put the contents of all elements whose class attribute contains x into one <x> element. So the output should be like this:

1 <x>234</x> 5 <x>6</x> 7 <x>8</x>

(or, ideally,

1 <x>2<y>3</y>4</x> 5 <x>6</x> 7 <x>8</x>

but that's a problem to tackle when I've solved this one.)

This is the relevant fragment of my XSLT:

<xsl:template match="span[contains(@class,'x') and preceding-sibling::span[1][not(contains(@class,'x'))]]">
  <x><xsl:for-each select=". | following-sibling::span[contains(@class,'x')]">
    <xsl:value-of select="text()"/>
  </xsl:for-each></x>
</xsl:template>

<xsl:template match="span[contains(@class,'x') and preceding-sibling::span[1][contains(@class,'x')]]">
</xsl:template>

<xsl:template match="span">
  <xsl:value-of select="text()"/>
</xsl:template>

What this produces is:

1 <x>23468</x> 5 <x>68</x> 7 <x>8</x>

I'm pretty sure I have to use a count in the XPath expression so that it doesn't select all of the following elements with class x, just the contiguous ones. But how can I count the contiguous ones? Or am I doing this the wrong way?

ptomato
  • 56,175
  • 13
  • 112
  • 165

3 Answers3

8

This is tricky, but doable (long read ahead, sorry for that).

The key to "consecutiveness" in terms of XPath axes (which are by definition not consecutive) is to check whether the closest node in the opposite direction that "first fulfills the condition" also is the one that "started" the series at hand:

a
b  <- first node to fulfill the condition, starts series 1
b  <- series 1
b  <- series 1
a
b  <- first node to fulfill the condition, starts series 2
b  <- series 2
b  <- series 2
a

In your case, a series consists of <span> nodes that have the string x in their @class:

span[contains(concat(' ', @class, ' '),' x ')] 

Note that I concat spaces to avoid false positives.

A <span> that starts a series (i.e. one that "first fulfills the condition") can be defined as one that has an x in its class and is not directly preceded by another <span> that also has an x:

not(preceding-sibling::span[1][contains(concat(' ', @class, ' '),' x ')])

We must check this condition in an <xsl:if> to avoid that the template generates output for nodes that are in a series (i.e., the template will do actual work only for "starter nodes").

Now to the tricky part.

From each of these "starter nodes" we must select all following-sibling::span nodes that have an x in their class. Also include the current span to account for series that only have one element. Okay, easy enough:

. | following-sibling::span[contains(concat(' ', @class, ' '),' x ')]

For each of these we now find out if their closest "starter node" is identical to the one that the template is working on (i.e. that started their series). This means:

  • they must be part of a series (i.e. they must follow a span with an x)

    preceding-sibling::span[1][contains(concat(' ', @class, ' '),' x ')]
    
  • now remove any span whose starter node is not identical to the current series starter. That means we check any preceding-sibling span (that has an x) which itself is not directly preceded by a span with an x:

    preceding-sibling::span[contains(concat(' ', @class, ' '),' x ')][
      not(preceding-sibling::span[1][contains(concat(' ', @class, ' '),' x ')])
    ][1]
    
  • Then we use generate-id() to check node identity. If the found node is identical to $starter, then the current span is one that belongs to the consecutive series.

Putting it all together:

<xsl:template match="span[contains(concat(' ', @class, ' '),' x ')]">
  <xsl:if test="not(preceding-sibling::span[1][contains(concat(' ', @class, ' '),' x ')])">
    <xsl:variable name="starter" select="." />
    <x>
      <xsl:for-each select="
        . | following-sibling::span[contains(concat(' ', @class, ' '),' x ')][
          preceding-sibling::span[1][contains(concat(' ', @class, ' '),' x ')]
          and
          generate-id($starter)
          =
          generate-id(
            preceding-sibling::span[contains(concat(' ', @class, ' '),' x ')][
              not(preceding-sibling::span[1][contains(concat(' ', @class, ' '),' x ')])
            ][1]
          )
        ]
      ">
        <xsl:value-of select="text()" />
      </xsl:for-each>
    </x>
  </xsl:if>
</xsl:template>

And yes, I know it's not pretty. There is an <xsl:key> based solution that is more efficient, Dimitre's answer shows it.

With your sample input, this output is generated:

1
<x>234</x>
5
<x>6</x>
7
<x>8</x>
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • But you need to think about the "ideally" problem, too :) – Dimitre Novatchev Jan 22 '12 at 17:11
  • @Dimitre I thought about it and decided that it would not at all be a funny thing to implement (given that `class="w x y z"` would mean I'd have to implement a stack and a tokenizer and error-handling for mis-nested elements). I'm not even sure if my solution here is ideal (assuming a key should not be used). – Tomalak Jan 22 '12 at 17:20
  • Tomalak, no solution is "ideal" -- good, working solutions suffice. – Dimitre Novatchev Jan 22 '12 at 17:39
  • @Dimitre As long as you attest that my logic/code does not contain blunder and cannot be simplified, it's okay with me. It seems a bit repetitive, but as far as I can see, every predicate is necessary. – Tomalak Jan 22 '12 at 17:42
5

I. XSLT Solutions:

What I want is to use an XSLT stylesheet to put the contents of all elements whose class attribute contains x into one <x> element. So the output should be like this:

1 <x>234</x> 5 <x>6</x> 7 <x>8</x>

This transformation:

 <xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:key name="kFollowing" match=
  "span[contains(concat(' ', @class, ' '),
                 ' x ')
        ]"
   use="generate-id(preceding-sibling::span
                                    [not(contains(concat(' ', @class, ' '),
                                             ' x '))
                                    ][1]
                    )
        "/>

 <xsl:template match=
 "span[contains(concat(' ', @class, ' '), ' x ')
     and
       not(contains(concat(' ', preceding-sibling::span[1]/@class, ' '),
                    ' x '
                    )
           )
      ]"
  >
     <x>
       <xsl:apply-templates mode="inGroup" select=
       "key('kFollowing',
             generate-id(preceding-sibling::span                                                           [not(contains(concat(' ', @class, ' '),                                                       ' x ')
                                 )
                            ][1]
                        )
            )
      "/>
     </x>
 </xsl:template>

 <xsl:template match=
 "span[contains(concat(' ', @class, ' '), ' x ')
     and
       contains(concat(' ', preceding-sibling::span[1]/@class, ' '),
                    ' x '
                    )
      ]
  "/>
</xsl:stylesheet>

when applied on the provided XML document (wrapped into a single top element html to be made well-formed):

<html>
    <span>1</span>
    <span class="x">2</span>
    <span class="x y">3</span>
    <span class="x">4</span>
    <span>5</span>
    <span class="x">6</span>
    <span>7</span>
    <span class="x">8</span>
</html>

produces the wanted, correct result:

1<x>234</x>5<x>6</x>7<x>8</x>

Then the "ideally" addition:

or, ideally,

1 <x>2<y>3</y>4</x> 5 <x>6</x> 7 <x>8</x> 

but that's a problem to tackle when I've solved this one.)

Just add to the above solution this template:

  <xsl:template mode="inGroup" match=
    "span[contains(concat(' ', @class, ' '),
                   ' y '
                   )
         ]">
    <y><xsl:value-of select="."/></y>
  </xsl:template>

When applying the so modified solution to the same XML document, again the (new) wanted result is produced:

1<x>2<y>3</y>4</x>5<x>6</x>7<x>8</x>

II. XSLT 2.0 Solution:

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:my="my:my" exclude-result-prefixes="my xs"
>
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/*">
         <xsl:for-each-group select="span" group-adjacent=
          "contains(concat(' ',@class,' '), ' x ')">

           <xsl:sequence select=
           "if(current-grouping-key())
              then
                my:formatGroup(current-group())
              else
                data(current-group())
           "/>
         </xsl:for-each-group>
 </xsl:template>

 <xsl:function name="my:formatGroup" as="node()*">
  <xsl:param name="pGroup" as="node()*"/>

  <x>
   <xsl:apply-templates select="$pGroup"/>
  </x>
 </xsl:function>

 <xsl:template match=
   "span[contains(concat(' ',@class, ' '), ' y ')]">
  <y><xsl:apply-templates/></y>
 </xsl:template>
</xsl:stylesheet>

When this XSLT 2.0 transformation is applied on the same XML document (above), the wanted "ideal" result is produced:

1<x>2<y>3</y>4</x>5<x>6</x>7<x>8</x>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
2

Thanks for the solutions. In the meantime I've managed to put together something using a completely different tactic. I'm just learning XSLT for this project and the most helpful thing I've read is that XSLT is like functional programming. So I wrote something using recursion, after being pointed in the right direction by this:

<xsl:template match="span[
                       contains(@class,'x')
                       and
                       preceding-sibling::span[1][
                         not(contains(@class,'x'))
                       ]
                     ]">
  <x><xsl:value-of select="text()"/>
    <xsl:call-template name="continue">
      <xsl:with-param name="next" select="following-sibling::span[1]"/>
    </xsl:call-template>
  </x>
</xsl:template>

<xsl:template name="continue">
  <xsl:param name="next"/>
  <xsl:choose>
    <xsl:when test="$next[contains(@class,'x')]">
      <xsl:apply-templates mode="x" select="$next"/>
      <xsl:call-template name="continue">
        <xsl:with-param name="next" select="$next/following-sibling::span[1]"/>
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise/><!-- Do nothing -->
  </xsl:choose>
</xsl:template>

<xsl:template match="span[
                       contains(@class,'x')
                       and
                       preceding-sibling::span[1][
                         contains(@class,'x')
                       ]
                     ]"/>

<xsl:template match="span">
  <xsl:value-of select="text()"/>
</xsl:template>

<xsl:template mode="x" match="span[contains(@class,'y')]">
  <y><xsl:value-of select="text()"/></y>
</xsl:template>

<xsl:template mode="x" match="span">
  <xsl:value-of select="text()"/>
</xsl:template>

I have no idea whether this is more or less efficient than doing it with generate-id() or keys, but I certainly learned something from your solutions!

ptomato
  • 56,175
  • 13
  • 112
  • 165
  • 1
    ptomato: There is nothing bad with a recursive solution, except that with large enough input the recursion often (unless specially coded using DVC or tail-recursion -- the latter not recognized by all XSLT processors) results in stack overflow and/or generally less than optimal efficiency. – Dimitre Novatchev Jan 22 '12 at 20:01