1

In XSLT 1.0, a common question in forums was how to convert flat HTML into hierarchical XML, which many times boiled down to nesting text in between <br /> tags in <p> tags.

I have a similar problem, which I think I've partially solved using XSLT 2.0, but it's a new approach to me and I'd like to get a second opinion.

The XHTML source has <span class="pageStart"></span> scattered throughout. They can appear in several different parent nodes. I want to wrap all the nodes between one page start marker and the next in an <page> node. The solution I currently have is:

<xsl:template match="*[child::span[@class='pageStart']]">
  <xsl:copy>
    <xsl:copy-of select="@*" />
      <xsl:for-each-group select="node()" 
                          group-starting-with="span[@class='pageStart']">
        <page>
          <xsl:apply-templates select="current-group()"/>
        </page>
      </xsl:for-each-group>
  </xsl:copy>
</xsl:template>

There's at least one flaw with this -- the parent node of the marker gets a <page> as a child node when I don't want it. In other works, if there's a <div> that has a child page marker anywhere in it, an <page> node is created as an immediate child of <div> in addition to the locations I expect.

I had hoped that I could simply make the template rule be <xsl:template match="span[@class='pageStart']"> but current-group() seems to be empty no matter what I try. The common sense approach I tried was <xsl:for-each-group select="node()" group-starting-with="span[@class='pageStart']">.

Is there an easier way to solve this problem that I'm missing?

EDIT

Here's an example of the input:

<?xml version="1.0" encoding="UTF-8"?>
<html>
<head></head>
<body>
    <span class="pageStart"/>
    <p>...</p>
    <div>...</div>
    <img />
    <p></p>
    <span class="pageStart"/>
    <div>...</div>
    <span class="pageStart"/>
    <p>...</p>
    <div>
        <span class="pageStart"/>
        <p>...</p>
        <p>...</p>
        <span class="pageStart"/>
        <div>...</div>
        <img/>
    </div>
</body>
</html>

I assume the last two nested pages make this problem more difficult, so I'd be perfectly happy getting this as the output, or something close:

<?xml version="1.0" encoding="UTF-8"?>
<html>
<head></head>
<body>
    <page>
        <span class="pageStart"/>
        <p>...</p>
        <div>...</div>
        <img />
        <p></p>
    </page>
    <page>
        <span class="pageStart"/>
        <div>...</div>
    </page>
    <page>
        <span class="pageStart"/>
        <p>...</p>
        <div>
            <page>
                <span class="pageStart"/>
                <p>...</p>
                <p>...</p>
            </page>
            <page>
                <span class="pageStart"/>
                <div>...</div>
                <img/>
            </page>
        </div>
    </page>
</body>
</html>
Mattio
  • 2,014
  • 3
  • 24
  • 37
  • It would be A LOT easier to decipher what you're asking for if you included some sample input and output XML. – Jim Garrison Mar 24 '11 at 02:34
  • That rule plus an identity rule will produce the exact output. What's the question? –  Mar 27 '11 at 16:49
  • Good question, +1. See my answer for a complete, short and easy solution. :) – Dimitre Novatchev Mar 27 '11 at 18:37
  • @Alejandro: I actually went back and forth whether or not to post this here or on codereview.stackexchange.com. I decided here because of the one flaw I mentioned. I'm trying out Dimitre's solution now. – Mattio Mar 29 '11 at 20:33

1 Answers1

0

This transformation:

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="*[span/@class='pageStart']">
  <xsl:copy>
   <xsl:copy-of select="@*"/>
   <xsl:for-each-group select="node()"
       group-starting-with="span[@class='pageStart']">
     <page>
      <xsl:apply-templates select="current-group()"/>
     </page>
   </xsl:for-each-group>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>

when applied on the provided XML document:

<html>
<head></head>
<body>
    <span class="pageStart"/>
    <p>...</p>
    <div>...</div>
    <img />
    <p></p>
    <span class="pageStart"/>
    <div>...</div>
    <span class="pageStart"/>
    <p>...</p>
    <div>
        <span class="pageStart"/>
        <p>...</p>
        <p>...</p>
        <span class="pageStart"/>
        <div>...</div>
        <img/>
    </div>
</body>
</html>

produces the wanted, correct result:

<html>
   <head/>
   <body>
      <page>
         <span class="pageStart"/>
         <p>...</p>
         <div>...</div>
         <img/>
         <p/>
      </page>
      <page>
         <span class="pageStart"/>
         <div>...</div>
      </page>
      <page>
         <span class="pageStart"/>
         <p>...</p>
         <div>
            <page>
               <span class="pageStart"/>
               <p>...</p>
               <p>...</p>
            </page>
            <page>
               <span class="pageStart"/>
               <div>...</div>
               <img/>
            </page>
         </div>
      </page>
   </body>
</html>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • My sample is too simple for the problem. The page start markers could appear at the end of something like a deeply nested div, really requiring all open tags to be closed, then re-opened to start a page to wrap the content. But it's no longer an issue because I was able to get different source XML that does not allow pages to start in arbitrary locations. Thanks for your help! – Mattio Mar 29 '11 at 21:24