2

I have string in XML, <italic>a</italic> and I am using xsl:analyze-string to extract all italic words with this pattern: "<italic>a</italic>". I know I can use template match on italic but the requirement here is to match it using regex. I am trying to write the expression like this, (<italic>)[a-z]+</italic>, but the XSLT processor is throwing an error on the opening < tag.

Any idea how to handle opening and closing tags in regex?

stealthyninja
  • 10,343
  • 11
  • 51
  • 59
atif
  • 1,137
  • 7
  • 22
  • 35

3 Answers3

3

You haven't said what your XML source looks like, but if <italic>a</italic> is an ordinary XML element, then you can't match the lexical form of the element using regular expressions. That's because the input to XSLT is a tree of nodes, not a string of lexical XML markup. That concept is absolutely crucial to understanding how XSLT works.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • a is an ordinary xml element not a string and I found a way in saxon xslt processor by using net.sf.saxon.serialize function to serialize the xml and then apply regular expression. It works greate. – atif Apr 12 '12 at 07:21
1

As long as <italic>a</italic> is an actual string, you can use &lt; for the < character. The greater-than (>) does not need to be escaped.

Example:

Sample XML Input

<test><![CDATA[<italic>a</italic>]]></test>

XSLT 2.0

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="/">
    <xsl:analyze-string select="test" regex="&lt;italic>([^&lt;]+)&lt;/italic>">
      <xsl:matching-substring>
        <results>
          <xsl:value-of select="regex-group(1)"/>
        </results>
      </xsl:matching-substring>
    </xsl:analyze-string>
  </xsl:template>

</xsl:stylesheet>

XML Output:

<results>a</results>
Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
  • The code might be better to read if you put the pattern in a CDATA itself with e.g. `<![CDATA[([^<])]]>`, then use that with `` – Martin Honnen Apr 02 '12 at 17:21
  • I agree with DevNull, but there is a slight error in your regex. IMHO the correct regex is: regex="<italic>([^<]+)</italic>" The extra plus is because the captured mark-up (if I have understood correctly) can be more than one character. The question specifies "italic words" which implies multiple characters – Sean B. Durkin Apr 03 '12 at 05:17
  • unfortunetly it's not a string, it's an actual element. I found a way in saxon xslt processor by using net.sf.saxon.serialize function to serialize the xml and then apply regular expression. – atif Apr 12 '12 at 07:22
0

<italic>a</italic> is an ordinary xml element, if you are using saxon xslt processor then use an extensions function net.sf.saxon.serialize to serialize the xml and then apply regular expression. It works great.

atif
  • 1,137
  • 7
  • 22
  • 35