I just ran across this same issue, and after far too long researching it, here's what I've concluded.
Java XSLT processors escape multi-byte UTF-8 characters into HTML entities even if the output mode is XML... if multibyte chars occur in a text() node that's not wrapped in CDATA. If the characters are wrapped in CDATA (for output) the multibyte character will be preserved.
My Problem:
I had an xml file that looked like this, complete with emoji.
<events>
<event>
<id>RANDOMID</id>
<blah>
<blahId>FOOONE</blahId>
</blah>
<blah>
<blahId>FOOTWO</blahId>
</blah>
<eventComment>Did some things. Had some Fun. </eventComment>
</event>
</events>
I started with an XSL stylesheet that looked like this:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/xhtml1/strict"
>
<xsl:output method = "xml" version="1.0" encoding = "UTF-8" omit-xml-declaration="no" indent="yes" />
<xsl:template match="/">
<events>
<xsl:for-each select="/events/event">
<event>
<xsl:copy-of select="./*[name() != 'blah'"/>
<xsl:for-each select="./blah">
<blahId><xsl:copy-of select="./blahId/text()"/></blahId>
</xsl:for-each>
</event>
</xsl:for-each>
</events>
</xsl:template>
</xsl:stylesheet>
Running this with a java Transformer consistently produced ��
where my emoji should be. Subsequent attempts to parse the resultant Document failed with the following exception message:
org.xml.sax.SAXParseException; lineNumber: y; columnNumber: x; Character reference "�" is an invalid XML character.
HOGWASH!
Testing this with xsltproc
on the command line was useless, since xsltproc
isn't stupid when it comes to multibyte characters. I got the output I expected.
A SOLUTION
Having the XSLT wrap the eventComment
in CDATA by specifying the QName in the xsl:output
tag cdata-section-elements
attribute will preserve the bytes and works with xsltproc and the java Transformer.
The magic here is the output cdata-secion-elements
property from the <xsl:output>
tag. https://www.w3.org/TR/xslt#output
I updated my XSL template to be:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/xhtml1/strict"
>
<xsl:output cdata-section-elements="eventComment" method="xml" version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:template match="/">
<events>
<xsl:for-each select="/events/event">
<event>
<xsl:copy-of select="./*[name() != 'blah' and name() != 'eventComment']"/>
<!-- For the cdata-section-elements to resolve that eventComment needs to be preserved as CDATA
(so we don't get java doing stupid things with unicode escapment)
it needs to be explicitly referenced here.
-->
<eventComment><xsl:copy-of select="./eventComment/text()"/></eventComment>
<xsl:for-each select="./blah">
<blahId><xsl:copy-of select="./blahId/text()"/></blahId>
</xsl:for-each>
</event>
</xsl:for-each>
</events>
</xsl:template>
</xsl:stylesheet>
And now my output from both xsltproc
and a java Transformer looks like this, and parses happily with java DocumentBuilders.
<?xml version="1.0" encoding="UTF-8"?>
<events xmlns="http://www.w3.org/TR/xhtml1/strict">
<event>
<id xmlns="">RANDOMID</id>
<eventComment><![CDATA[Did some things. Had some Fun. ]]></eventComment>
<blahId>FOO</blahId>
<blahId>FOOTOO</blahId>
</event>
</events>