Tokenize mixed content in XSLT

Question

I have an element <mixed> that contains mixed content. Is it possible to use XSLT (2.0) to wrap all “words” (delimited by the pattern \s+, for example) inside <mixed> in a <w> tag, descending into inline elements when necessary? For example, given the following input:

<mixed>
  One morning, when <a>Gregor Samsa</a>
  woke from troubled dreams, he found
  himself transformed in his bed into
  a <b><c>horrible vermin</c></b>.
</mixed>

I want something like the following output:

<mixed>
  <w>One</w> <w>morning,</w> <w>when</w> <a><w>Gregor</w> <w>Samsa</w></a>
  <w>woke</w> <w>from</w> <w>troubled</w> <w>dreams,</w> <w>he</w> <w>found</w>
  <w>himself</w> <w>transformed</w> <w>in</w> <w>his</w> <w>bed</w> <w>into</w>
  <w>a</w> <b><c><w>horrible</w></c></b> <w><b><c>vermin</c></b>.</w>
</mixed>

Dimitre Novatchev provided a template in an answer to this related question that goes much of the way to solving this, but does not satisfy the following requirements:

Inline elements that terminate within a “word” should be split so that a single <w> element contains the whole “word.” Otherwise there would be invalid XML, such as:
```
 <w>a</w> <w><c>horrible</w> <w>vermin</c>.</w>
```
However, this template detaches the punctuation . after vermin and produces:
```
 <w>a</w> <c><w>horrible</w> <w>vermin</w></c> <w>.</(w>
```
(Edit: None of the current 3 answers satisfy this requirement.)
The split token must not be discarded. Consider the similar task of wrapping non-coefficient numbers in  tags in the context of a chemical formula. For example, <reactants>2H2 + O2</reactants> becomes <reactants>2H2 + O2</reactants>. This is not possible using the tokenize function because it simply discards the separator. Instead we will probably have to fall back on analyze-string.

If not XSLT, what is the best method to do this?

Hi Michael, none of the posted answers have solved my question. What do I do then? I don't really have enough rep for bounties. — hftf, Apr 16 '16 at 06:44
IMHO, you need to better define the problem. No one can provide an algorithm to solve a problem if they don't know how to solve it manually using only paper and pencil - and a few clear rules. — michael.hor257k, Apr 16 '16 at 11:36

michael.hor257k · Answer 1 · 2016-04-01T12:40:13.143

1

AFAICT, this would provide the expected result in your example:

XSLT 2.0

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="no"/>
<xsl:strip-space elements="*"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="text()[ancestor::mixed]">
    <xsl:analyze-string select="." regex="\s+">
        <xsl:matching-substring>
            <xsl:value-of select="." />
        </xsl:matching-substring>
        <xsl:non-matching-substring>
            <w>
                <xsl:value-of select="." />
            </w>
        </xsl:non-matching-substring>
    </xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>

However, I did not understand your point regarding "Inline elements that terminate within a “word”". What would be the expected result when, for example, a part of a word is italicized?

edited Apr 01 '16 at 12:40

answered Apr 01 '16 at 11:43

michael.hor257k

113,275
6
33
51

I consider `vermin.` to be a single word because it doesn’t contain the separator pattern `\s+`. However the inline elements `` and `` are opened in a previous word but are closed in the middle of that word: `vermin.` So in order to wrap only `vermin.` in one element, it must be that the inline elements `` and ``, that are part of several words, are split. An example of the expected result is already in my post: `a horrible vermin.`. Please let me know if this makes sense. – hftf Apr 01 '16 at 11:54
If I understand correctly, the suggested solution does what you want in that regard - although for an entirely different reason, namely that each text node is processed individually. – michael.hor257k Apr 01 '16 at 12:03
I believe that your solution wraps spaces, not words, with `` because the `regex` parameter in `analyze-string` is `\s+`. Even when changed to `\S+`, the output ends with `vermin.`. – hftf Apr 01 '16 at 12:09
"*I believe that your solution wraps spaces, not words*" Right, fixed now. -- "*the output ends with...*" We need clearer rules here. Until then, if a period is a part of a word, then a single period is a one-letter word. If you want each text node to also consider the previous/next nodes, this will get **much** more complicated. – michael.hor257k Apr 01 '16 at 12:47
Indeed, it's a complicated problem. – hftf Apr 01 '16 at 12:51

score 0 · Answer 2 · answered Apr 01 '16 at 11:42

If you use analyze-string on \S+ with

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="mixed//text()">
        <xsl:analyze-string select="." regex="\S+">
            <xsl:matching-substring>
                <w>
                    <xsl:value-of select="."/>
                </w>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>

</xsl:transform>

you get

<mixed>
  <w>One</w> <w>morning,</w> <w>when</w> <a><w>Gregor</w> <w>Samsa</w></a>
  <w>woke</w> <w>from</w> <w>troubled</w> <w>dreams,</w> <w>he</w> <w>found</w>
  <w>himself</w> <w>transformed</w> <w>in</w> <w>his</w> <w>bed</w> <w>into</w>
  <w>a</w> <b><c><w>horrible</w> <w>vermin</w></c></b><w>.</w>
</mixed>

Do you really want to join the trailing dot with the preceding vermin that is inside of your inline elements?

Correct, I do want to join the trailing dot. – hftf Apr 01 '16 at 11:57 — hftf, Apr 01 '16 at 11:57

score 0 · Answer 3 · answered Apr 01 '16 at 12:49

How about this XSLT, which has an extra template to cope with elements that are immediately followed by a text node containing only a full stop.

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match="text()">
  <xsl:for-each select="tokenize(., '[\s]')[.]">
   <w><xsl:sequence select="."/></w>
  </xsl:for-each>
 </xsl:template>

 <xsl:template match="text()[normalize-space() = '.']" />

 <xsl:template match="node()[following-sibling::node()[1][self::text()][normalize-space() = '.']]">
  <w>
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
     <xsl:text>.</xsl:text>
  </w>
 </xsl:template>
</xsl:stylesheet>

Tokenize mixed content in XSLT

3 Answers3

Linked