how to handle "<" and ">" in regex in xslt

Question

I have string in XML, <italic>a</italic> and I am using xsl:analyze-string to extract all italic words with this pattern: "<italic>a</italic>". I know I can use template match on italic but the requirement here is to match it using regex. I am trying to write the expression like this, (<italic>)[a-z]+</italic>, but the XSLT processor is throwing an error on the opening < tag.

Any idea how to handle opening and closing tags in regex?

Is `a` in CDATA or otherwise escaped (`<`/`>`)? Are you sure the processor sees it as a string? — Daniel Haley, Apr 02 '12 at 15:36
I tried converting into (\<)(italic)(\>))[a-z+](\<)(italic)(\/)(\>)) but same result... xslt throwing error... — atif, Apr 02 '12 at 15:44

score 3 · Answer 1 · answered Apr 02 '12 at 17:51

3

You haven't said what your XML source looks like, but if <italic>a</italic> is an ordinary XML element, then you can't match the lexical form of the element using regular expressions. That's because the input to XSLT is a tree of nodes, not a string of lexical XML markup. That concept is absolutely crucial to understanding how XSLT works.

answered Apr 02 '12 at 17:51

Michael Kay

156,231
11
92
164

a is an ordinary xml element not a string and I found a way in saxon xslt processor by using net.sf.saxon.serialize function to serialize the xml and then apply regular expression. It works greate. – atif Apr 12 '12 at 07:21

Daniel Haley · Answer 2 · 2012-04-03T06:47:15.763

1

As long as <italic>a</italic> is an actual string, you can use < for the < character. The greater-than (>) does not need to be escaped.

Example:

Sample XML Input

<test><![CDATA[<italic>a</italic>]]></test>

XSLT 2.0

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="/">
    <xsl:analyze-string select="test" regex="&lt;italic>([^&lt;]+)&lt;/italic>">
      <xsl:matching-substring>
        <results>
          <xsl:value-of select="regex-group(1)"/>
        </results>
      </xsl:matching-substring>
    </xsl:analyze-string>
  </xsl:template>

</xsl:stylesheet>

XML Output:

<results>a</results>

edited Apr 03 '12 at 06:47

answered Apr 02 '12 at 16:00

Daniel Haley

51,389
6
69
95

The code might be better to read if you put the pattern in a CDATA itself with e.g. `<![CDATA[([^<])]]>`, then use that with `` – Martin Honnen Apr 02 '12 at 17:21
I agree with DevNull, but there is a slight error in your regex. IMHO the correct regex is: regex="<italic>([^<]+)</italic>" The extra plus is because the captured mark-up (if I have understood correctly) can be more than one character. The question specifies "italic words" which implies multiple characters – Sean B. Durkin Apr 03 '12 at 05:17
unfortunetly it's not a string, it's an actual element. I found a way in saxon xslt processor by using net.sf.saxon.serialize function to serialize the xml and then apply regular expression. – atif Apr 12 '12 at 07:22

score 0 · Accepted Answer · answered Apr 12 '12 at 07:25

0

<italic>a</italic> is an ordinary xml element, if you are using saxon xslt processor then use an extensions function net.sf.saxon.serialize to serialize the xml and then apply regular expression. It works great.

answered Apr 12 '12 at 07:25

atif

1,137
7
22
35

how to handle "<" and ">" in regex in xslt

3 Answers3

Linked