1

I have an element <mixed> that contains mixed content. Is it possible to use XSLT (2.0) to wrap all “words” (delimited by the pattern \s+, for example) inside <mixed> in a <w> tag, descending into inline elements when necessary? For example, given the following input:

<mixed>
  One morning, when <a>Gregor Samsa</a>
  woke from troubled dreams, he found
  himself transformed in his bed into
  a <b><c>horrible vermin</c></b>.
</mixed>

I want something like the following output:

<mixed>
  <w>One</w> <w>morning,</w> <w>when</w> <a><w>Gregor</w> <w>Samsa</w></a>
  <w>woke</w> <w>from</w> <w>troubled</w> <w>dreams,</w> <w>he</w> <w>found</w>
  <w>himself</w> <w>transformed</w> <w>in</w> <w>his</w> <w>bed</w> <w>into</w>
  <w>a</w> <b><c><w>horrible</w></c></b> <w><b><c>vermin</c></b>.</w>
</mixed>

Dimitre Novatchev provided a template in an answer to this related question that goes much of the way to solving this, but does not satisfy the following requirements:

  • Inline elements that terminate within a “word” should be split so that a single <w> element contains the whole “word.” Otherwise there would be invalid XML, such as:

      <w>a</w> <w><b><c>horrible</w> <w>vermin</c></b>.</w>

    However, this template detaches the punctuation . after vermin and produces:

      <w>a</w> <b><c><w>horrible</w> <w>vermin</w></c></b> <w>.</(w>
    

    (Edit: None of the current 3 answers satisfy this requirement.)

  • The split token must not be discarded. Consider the similar task of wrapping non-coefficient numbers in <sub> tags in the context of a chemical formula. For example, <reactants>2H2 + O2</reactants> becomes <reactants>2H<sub>2</sub> + O<sub>2</sub></reactants>. This is not possible using the tokenize function because it simply discards the separator. Instead we will probably have to fall back on analyze-string.

If not XSLT, what is the best method to do this?

Community
  • 1
  • 1
hftf
  • 97
  • 1
  • 9
  • Hi Michael, none of the posted answers have solved my question. What do I do then? I don't really have enough rep for bounties. – hftf Apr 16 '16 at 06:44
  • IMHO, you need to better define the problem. No one can provide an algorithm to solve a problem if they don't know how to solve it manually using only paper and pencil - and a few clear rules. – michael.hor257k Apr 16 '16 at 11:36

3 Answers3

1

AFAICT, this would provide the expected result in your example:

XSLT 2.0

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="no"/>
<xsl:strip-space elements="*"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="text()[ancestor::mixed]">
    <xsl:analyze-string select="." regex="\s+">
        <xsl:matching-substring>
            <xsl:value-of select="." />
        </xsl:matching-substring>
        <xsl:non-matching-substring>
            <w>
                <xsl:value-of select="." />
            </w>
        </xsl:non-matching-substring>
    </xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>

However, I did not understand your point regarding "Inline elements that terminate within a “word”". What would be the expected result when, for example, a part of a word is italicized?

michael.hor257k
  • 113,275
  • 6
  • 33
  • 51
  • I consider `vermin.` to be a single word because it doesn’t contain the separator pattern `\s+`. However the inline elements `` and `` are opened in a previous word but are closed in the middle of that word: `vermin.` So in order to wrap only `vermin.` in one element, it must be that the inline elements `` and ``, that are part of several words, are split. An example of the expected result is already in my post: `a horrible vermin.`. Please let me know if this makes sense. – hftf Apr 01 '16 at 11:54
  • If I understand correctly, the suggested solution does what you want in that regard - although for an entirely different reason, namely that each text node is processed individually. – michael.hor257k Apr 01 '16 at 12:03
  • I believe that your solution wraps spaces, not words, with `` because the `regex` parameter in `analyze-string` is `\s+`. Even when changed to `\S+`, the output ends with `vermin.`. – hftf Apr 01 '16 at 12:09
  • "*I believe that your solution wraps spaces, not words*" Right, fixed now. -- "*the output ends with...*" We need clearer rules here. Until then, if a period is a part of a word, then a single period is a one-letter word. If you want each text node to also consider the previous/next nodes, this will get **much** more complicated. – michael.hor257k Apr 01 '16 at 12:47
  • Indeed, it's a complicated problem. – hftf Apr 01 '16 at 12:51
0

If you use analyze-string on \S+ with

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="mixed//text()">
        <xsl:analyze-string select="." regex="\S+">
            <xsl:matching-substring>
                <w>
                    <xsl:value-of select="."/>
                </w>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>

</xsl:transform>

you get

<mixed>
  <w>One</w> <w>morning,</w> <w>when</w> <a><w>Gregor</w> <w>Samsa</w></a>
  <w>woke</w> <w>from</w> <w>troubled</w> <w>dreams,</w> <w>he</w> <w>found</w>
  <w>himself</w> <w>transformed</w> <w>in</w> <w>his</w> <w>bed</w> <w>into</w>
  <w>a</w> <b><c><w>horrible</w> <w>vermin</w></c></b><w>.</w>
</mixed>

Do you really want to join the trailing dot with the preceding vermin that is inside of your inline elements?

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
0

How about this XSLT, which has an extra template to cope with elements that are immediately followed by a text node containing only a full stop.

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match="text()">
  <xsl:for-each select="tokenize(., '[\s]')[.]">
   <w><xsl:sequence select="."/></w>
  </xsl:for-each>
 </xsl:template>

 <xsl:template match="text()[normalize-space() = '.']" />

 <xsl:template match="node()[following-sibling::node()[1][self::text()][normalize-space() = '.']]">
  <w>
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
     <xsl:text>.</xsl:text>
  </w>
 </xsl:template>
</xsl:stylesheet>
Tim C
  • 70,053
  • 14
  • 74
  • 93