XSLT 2.0: Create RegEx to enumerate chapter numbers and description from continous text nodes

Question

I like to extract chapter numbers, their title and their description from an XML file to an XML element/attribute hierarchy. They are distributed in continuous text in different elements. The XML looks like this:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <cell>3.1.1.17 First Section The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.
  </cell>
  <cell>3.1.1.18 Second Section This section lists things that occur under certain conditions. 3.1.1.19 Third Section This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.
  </cell>
</root>

The desired output should look like this:

<?xml version="1.0" encoding="utf-8"?>
<Root>
   <Desc chapter="3.1.1.17" title="First Section">The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.</Desc>
   <Desc chapter="3.1.1.18" title="Second Section">This section lists things that occur under certain conditions.</Desc>
   <Desc chapter="3.1.1.19" title="Third Section">This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.</Desc>
</Root>

My XSLT so far is:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml" encoding="utf-8" />

  <xsl:template match="text()" />

  <xsl:template match="/root">
    <Root>
      <xsl:apply-templates select="cell" />
    </Root>
  </xsl:template>

  <xsl:template match="cell">
    <xsl:variable name="sections" as="element(Desc)*">
      <xsl:analyze-string regex="(\d+\.\d+\.\d+\.\d+)\s(.*?Section)(.*?)" select="text()">
        <xsl:matching-substring>
          <Desc chapter="{regex-group(1)}" title="{regex-group(2)}">
            <xsl:value-of select="regex-group(3)" />
          </Desc>
        </xsl:matching-substring>
      </xsl:analyze-string>
    </xsl:variable>
    <xsl:for-each select="$sections">
      <xsl:copy-of select="." />
    </xsl:for-each>
  </xsl:template>  
</xsl:stylesheet>

The problem is situated in the last part of the RegEx: (.*?) - a non-greedy consuming expression. Unfortunately I can't make it stop at the right position. I tried to use ?: and (?=...) to make it stop non-consuming before the next \d+\.\d+\.\d+\.\d+\., but it seems the RegEx syntax of XSLT-2.0 is somewhat different from other dialects.

How would I extract the relevant parts to conveniently process them in the for-each as regex-group(1..3)?

And, additionally, I am interested in a pretty complete XSLT-2.0 reference of all RegEx-tokens.

score 1 · Accepted Answer · answered Mar 29 '16 at 10:07

It seems

<xsl:template match="cell">
    <xsl:variable name="sections">
        <xsl:analyze-string regex="(\d+\.\d+\.\d+\.\d+)\s(.*?Section)" select=".">
            <xsl:matching-substring>
                <xsl:message select="concat('|', regex-group(3), '|')"/>
                <Desc chapter="{regex-group(1)}" title="{regex-group(2)}">
                    <xsl:value-of select="regex-group(3)" />
                </Desc>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <Value>
                    <xsl:value-of select="."/>
                </Value>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:variable>
    <xsl:for-each select="$sections/Desc">
        <xsl:copy>
            <xsl:copy-of select="@*"/>
            <xsl:value-of select="following-sibling::Value[1]"/>
        </xsl:copy>
    </xsl:for-each>
</xsl:template>

captures both the data you want to select and the trailing text.

Thank you very much. Using `xsl:non-matching-substring` is a great idea. — zx485, Mar 29 '16 at 10:24

score -1 · Answer 2 · answered Mar 29 '16 at 10:54

Sorry that i have to reply in JS but i trust you can simply figure out what's going on. Your regex and replace solution should be like this;

var xmlData = '<?xml version="1.0" encoding="utf-8"?>\n<root>\n  <cell>3.1.1.17 First Section The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.\n  </cell>\n  <cell>3.1.1.18 Second Section This section lists things that occur under certain conditions. 3.1.1.19 Third Section This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.\n  </cell>\n</root>',
        rex = /<cell>(?:\s*(\d+.\d+.\d+.\d+)\s+(\w+)\s+Section)(.+)\n*\s*<\/cell>/gm,
        xml = xmlData.replace(rex,'<Desc chapter="$1" title="$2 Section">$3</desc>');
console.log(xmlData);
<?xml version="1.0" encoding="utf-8"?>
<root>
  <Desc chapter="3.1.1.17" title="First Section"> The “First appropriate” section lists things that can occur when an event happens. All of these event conditions result in an error.</desc>
  <Desc chapter="3.1.1.18" title="Second Section"> This section lists things that occur under certain conditions. 3.1.1.19 Third Section This section lists events that occur within a specific space. 3.2 SPACE chapter provides descriptions other stuff. See also: Chapter 4, “Other Stuff Reference” in the Manual.</desc>
</root>

*Sorry that i have to reply in JS* No, you really don't *have to reply in JS*. If you're truly sorry, then don't reply in the first place (or delete your answer now). Parsing XML with regex is [terribly brittle and should not be encouraged](https://stackoverflow.com/q/6751105/290085). Answering XSLT questions by posting JS is unhelpful and poor form. ***Future readers: Don't do this.*** — kjhughes, Jan 15 '21 at 04:31

XSLT 2.0: Create RegEx to enumerate chapter numbers and description from continous text nodes

2 Answers2