All of the three different solutions below use the XSLT design pattern of overriding the identity rule to generally preserve the structure and contents of the XML document, and only modify specific nodes.
I. XSLT 1.0 solution:
This short and simple transformation (no <xsl:choose>
used anywhere):
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[not(self::title)]/text()"
name="split">
<xsl:param name="pText" select=
"concat(normalize-space(.), ' ')"/>
<xsl:if test="string-length(normalize-space($pText)) >0">
<span>
<xsl:value-of select=
"substring-before($pText, ' ')"/>
</span>
<xsl:call-template name="split">
<xsl:with-param name="pText"
select="substring-after($pText, ' ')"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
when applied to the provided XML document:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div> Text in a div </div>
<div>
Text in a div
<p>
Text inside a p
</p>
</div>
</body>
</html>
produces the wanted, correct result:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
</div>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
<p>
<span>Text</span>
<span>inside</span>
<span>a</span>
<span>p</span>
</p>
</div>
</body>
</html>
II. XSLT 2.0 solution:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[not(self::title)]/text()">
<xsl:for-each select="tokenize(., '[\s]')[.]">
<span><xsl:sequence select="."/></span>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied to the same XML document (above), again the correct, wanted result is produced:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
</div>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
<p>
<span>Text</span>
<span>inside</span>
<span>a</span>
<span>p</span>
</p>
</div>
</body>
</html>
III Solution using FXSL:
Using the str-split-to-words
template/function of FXSL one can easily implement much more complicated tokenization -- in any version of XSLT:
Let's have a more complicated XML document and tokenization rules:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div> Text: in a div </div>
<div>
Text; in; a. div
<p>
Text- inside [a] [p]
</p>
</div>
</body>
</html>
Here there is more than one delimiter that indicates the start or end of a word. In this particular example the delimiters can be: " "
, ";"
, "."
, ":"
, "-"
, "["
, "]"
.
The following transformation uses FXSL for this more complicated tokenization:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ext="http://exslt.org/common"
exclude-result-prefixes="ext">
<xsl:import href="strSplit-to-Words.xsl"/>
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[not(self::title)]/text()">
<xsl:variable name="vwordNodes">
<xsl:call-template name="str-split-to-words">
<xsl:with-param name="pStr" select="normalize-space(.)"/>
<xsl:with-param name="pDelimiters"
select="' ;.:-[]'"/>
</xsl:call-template>
</xsl:variable>
<xsl:apply-templates select="ext:node-set($vwordNodes)/*"/>
</xsl:template>
<xsl:template match="word[string-length(normalize-space(.)) > 0]">
<span>
<xsl:value-of select="."/>
</span>
</xsl:template>
</xsl:stylesheet>
and produces the wanted, correct result:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
</div>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
<p>
<span>Text</span>
<span>inside</span>
<span>a</span>
<span>p</span>
<word/>
</p>
</div>
</body>
</html>