0

I try to split a large XML (like 10 GB) file into smaller XML files with XSL streaming.

The XML looks like:

<?xml version="1.0" encoding="UTF-8"?>
<Book>
  <Header>...</Header>
  <Entry>...</Entry>
  <Entry>...</Entry>
  <Entry>...</Entry>
  <Entry>...</Entry>
</Book>

The XSL looks like:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:mode streamable="yes" on-no-match="shallow-copy"/>

    <xsl:template match="/">
        <xsl:apply-templates select="Book">
            <xsl:with-param name="header" select="/Book/Header"/>
            <xsl:with-param name="top-level-element" select="name(/*[1])"/>
        </xsl:apply-templates>
    </xsl:template>

    <xsl:template match="Book">
        <xsl:param name="top-level-element"/>
        <xsl:param name="header"/>
        <xsl:result-document href="{concat(position(),'.xml')}" method="xml">
            <xsl:element name="{$top-level-element}">
                <xsl:value-of select="$header"/>
                <xsl:iterate
                        select="Entry">
                    <xsl:apply-templates select="."/>
                </xsl:iterate>
            </xsl:element>
        </xsl:result-document>
    </xsl:template>

    <xsl:template match="Entry">
        <xsl:copy-of select="."/>
    </xsl:template>

</xsl:stylesheet>

When I call the transformation with the XML I get following error:

Error on line 6 column 29 
  XTSE3430  Template rule is not streamable
  * Operand {Book/Header} of {xsl:apply-templates} selects streamed nodes in a
  context that allows arbitrary navigation (line 8)
  * The result of the template rule can contain streamed nodes

Can someone help me what I'm doing wrong?

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
Mark
  • 3
  • 2
  • Is that Saxon EE? Then use a capturing accumulator to store the header. – Martin Honnen Apr 05 '23 at 16:03
  • See the section in https://www.saxonica.com/html/documentation12/streaming/xslt-streaming.html starting with "Saxon (from 9.9) supports an additional capability: capturing accumulators..". – Martin Honnen Apr 05 '23 at 16:13
  • From that code, it is also rather unclear how you want to split the input document; you match on the `Book` (which is the root element in your sample) and have a single `xsl:result-document` there. – Martin Honnen Apr 05 '23 at 16:28
  • Hello Martin, yes i am using Saxon EE and according to the documentary the accumulator should be the solution. – Mark Apr 05 '23 at 18:37
  • I added following to the xsl: and in the entry template: But results in error: Error in {accumulator('header')} at char 0 in xsl:variable/@select on line 20 column 69 XPST0017: Cannot find a 1-argument function named Q{http://www.w3.org/2005/xpath-functions}accumulator() What i am doing wrong? – Mark Apr 05 '23 at 18:43
  • The error message states you are trying to call a function named `accumulator`, not the one called `accumulator-before`. – Martin Honnen Apr 05 '23 at 19:11

2 Answers2

1

As a general rule, you can't store streamed nodes in a variable, because that would enable you to refer to the node later, when the stream has moved on, and the whole point of streaming is that when a node has gone, it has gone. And this is particularly true for template parameters: the analysis of each template rule is done independently, and if a streamed node could be passed in a parameter, there would be no way of knowing that it is being processed in a streamable way.

If this were the only problem, you could solve it by making a copy of the Header element and passing that instead: select="copy-of(Header)". But there's another problem, which is that your template rule is making multiple downward selections. You simply can't select downwards to /Book, then to /Book/Header, then to /*[1].

But in fact you don't need to. You've only got one Book, so you could scrap the match="/" template, and start with

<xsl:template match="Book">

with no parameters. Or if you don't know that it will always be a book, make it match="/*".

With these details sorted, the only real challenge is how to "remember" the header so that you can use it repeatedly while processing the entries. Martin has suggested a way of doing this using accumulators. I think my own approach might be to use xsl:iterate, like this:

<xsl:template match="/*">
  <xsl:variable name="top-level-element" select="name(.)"/>
  <xsl:iterate select="*">
    <xsl:param name="header" select="()"/>
    <xsl:choose>
       <xsl:when test="self::header">
          <xsl:next-iteration>
             <xsl:with-param name="header" select="copy-of(.)"/>
          </xsl:next-iteration>
       </xsl:when>
       <xsl:otherwise>
          <xsl:result-document>
           ....
              <xsl:copy-of select="$header"/>
           ....

     

Both xsl:iterate and accumulators were expressly designed to provide a way of "remembering" selected data while processing fowards through the stream.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

Here is an example that outputs a new file for each Entry in the input document, copying the Header that the accumulator has captured:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="#all"
  xmlns:saxon="http://saxon.sf.net/">

  <xsl:mode on-no-match="shallow-skip" streamable="yes" use-accumulators="#all"/>
  
  <xsl:output indent="yes"/>

  <xsl:accumulator name="header" as="element(Header)?"  streamable="yes" initial-value="()">
    <xsl:accumulator-rule match="Header" phase="end" saxon:capture="yes" select="." />
  </xsl:accumulator>
  
  <xsl:accumulator name="Entry-count" as="xs:integer" streamable="yes" initial-value="0">
    <xsl:accumulator-rule match="Entry" select="$value + 1"/>
  </xsl:accumulator>
  
  <xsl:template match="Entry">
    <xsl:result-document href="Entry-{accumulator-before('Entry-count')}.xml">
      <xsl:element name="{name(ancestor::*[last()])}" namespace="{namespace-uri(ancestor::*[last()])}">
        <xsl:copy-of select="accumulator-before('header')"/>
        <xsl:copy-of select="."/>
      </xsl:element>
    </xsl:result-document>
  </xsl:template>
  
</xsl:stylesheet>

If you don't want to split on each Entry, then, assuming you want to store a certain number of adjacent Entry elements in a result document, you can use positional grouping rather easily with streaming:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="#all"
  xmlns:saxon="http://saxon.sf.net/">

  <xsl:param name="chunk-size" as="xs:integer" select="5"/>

  <xsl:mode on-no-match="shallow-skip" streamable="yes" use-accumulators="#all"/>
  
  <xsl:output indent="yes"/>

  <xsl:accumulator name="header" as="element(Header)?"  streamable="yes" initial-value="()">
    <xsl:accumulator-rule match="Header" phase="end" saxon:capture="yes" select="." />
  </xsl:accumulator>
  
  <xsl:template match="/*">
    <xsl:for-each-group select="Entry" group-adjacent="(position() - 1) idiv $chunk-size">
      <xsl:result-document href="chunk-{position()}.xml">
        <xsl:element name="{name(ancestor::*[last()])}" namespace="{namespace-uri(ancestor::*[last()])}">
          <xsl:copy-of select="accumulator-before('header')"/>
          <xsl:copy-of select="current-group()"/>
        </xsl:element>
      </xsl:result-document>
    </xsl:for-each-group>
  </xsl:template>
  
</xsl:stylesheet>
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110