5

My XSLT transformations have been successful for months until I ran across an XML file with Unicode characters (most likely emoji). I need to preserve the Unicode but XSLT is converting it to HTML Entities. I thought that setting the encoding to UTF-8 would solve my problem but I'm still having issues.

Any help appreciated. Code:

private byte[] transform(InputStream stream) throws Exception{
    System.setProperty("javax.xml.transform.TransformerFactory", "org.apache.xalan.processor.TransformerFactoryImpl"); 

    Transformer xmlTransformer;

    xmlTransformer = (TransformerImpl) TransformerFactory.newInstance().newTransformer(new   StreamSource(createXsltStylesheet()));
    xmlTransformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");

    XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(stream,"UTF-8");
    Source staxSource = new StAXSource(reader, true); 
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    Writer writer = new OutputStreamWriter(outputStream, "UTF-8");
    xmlTransformer.transform(staxSource, new StreamResult(writer));


    return outputStream.toByteArray();
}

If I add

xmlTransformer.setOutputProperty(OutputKeys.METHOD, "text");

the Unicode is preserved but the XML is not.

l15a
  • 2,547
  • 5
  • 29
  • 41
  • 1
    Similar (but unfortunately also unanswered) http://stackoverflow.com/questions/15592025/transformer-setoutputpropertyoutputkeys-encoding-utf-8-is-not-working, this is looking better: http://stackoverflow.com/questions/443305/producing-valid-xml-with-java-and-utf-8-encoding – Tomalak Aug 07 '13 at 06:03
  • Xalan encodes emojis property using UTF-16, rather than UTF-8. Try: `transformer.setOutputProperty(ENCODING, UTF_16.toString());` – Dave Jarvis Apr 01 '22 at 18:10

4 Answers4

2

I just ran across this same issue, and after far too long researching it, here's what I've concluded.

Java XSLT processors escape multi-byte UTF-8 characters into HTML entities even if the output mode is XML... if multibyte chars occur in a text() node that's not wrapped in CDATA. If the characters are wrapped in CDATA (for output) the multibyte character will be preserved.

My Problem:

I had an xml file that looked like this, complete with emoji.

<events>
    <event>
       <id>RANDOMID</id>
       <blah>
          <blahId>FOOONE</blahId>
       </blah>
       <blah>
          <blahId>FOOTWO</blahId>
       </blah>
       <eventComment>Did some things. Had some Fun. </eventComment>
    </event>
</events>

I started with an XSL stylesheet that looked like this:

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns="http://www.w3.org/TR/xhtml1/strict"
>
    <xsl:output method = "xml" version="1.0" encoding = "UTF-8" omit-xml-declaration="no" indent="yes" />

    <xsl:template match="/">
        <events>
            <xsl:for-each select="/events/event">
                <event>
                    <xsl:copy-of select="./*[name() != 'blah'"/>
                    <xsl:for-each select="./blah">
                        <blahId><xsl:copy-of select="./blahId/text()"/></blahId>
                    </xsl:for-each>
                </event>
            </xsl:for-each>
        </events>
    </xsl:template>
</xsl:stylesheet>

Running this with a java Transformer consistently produced &#55357;&#56397; where my emoji should be. Subsequent attempts to parse the resultant Document failed with the following exception message:

org.xml.sax.SAXParseException; lineNumber: y; columnNumber: x; Character reference "&#55357" is an invalid XML character.

HOGWASH!

Testing this with xsltproc on the command line was useless, since xsltproc isn't stupid when it comes to multibyte characters. I got the output I expected.

A SOLUTION

Having the XSLT wrap the eventComment in CDATA by specifying the QName in the xsl:output tag cdata-section-elements attribute will preserve the bytes and works with xsltproc and the java Transformer.

The magic here is the output cdata-secion-elements property from the <xsl:output> tag. https://www.w3.org/TR/xslt#output

I updated my XSL template to be:

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns="http://www.w3.org/TR/xhtml1/strict"
>
    <xsl:output  cdata-section-elements="eventComment" method="xml" version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>

    <xsl:template match="/">
        <events>
            <xsl:for-each select="/events/event">
                <event>
                    <xsl:copy-of select="./*[name() != 'blah' and name() != 'eventComment']"/>
                    <!-- For the cdata-section-elements to resolve that eventComment needs to be preserved as CDATA
                        (so we don't get java doing stupid things with unicode escapment)
                         it needs to be explicitly referenced here.
                    -->
                    <eventComment><xsl:copy-of select="./eventComment/text()"/></eventComment>
                    <xsl:for-each select="./blah">
                        <blahId><xsl:copy-of select="./blahId/text()"/></blahId>
                    </xsl:for-each>
                </event>
            </xsl:for-each>
        </events>
    </xsl:template>
</xsl:stylesheet>

And now my output from both xsltproc and a java Transformer looks like this, and parses happily with java DocumentBuilders.

<?xml version="1.0" encoding="UTF-8"?>
<events xmlns="http://www.w3.org/TR/xhtml1/strict">
  <event>
    <id xmlns="">RANDOMID</id>
    <eventComment><![CDATA[Did some things. Had some Fun. ]]></eventComment>
    <blahId>FOO</blahId>
    <blahId>FOOTOO</blahId>
  </event>
</events>
bvarner
  • 363
  • 3
  • 9
0

This line is suspicious:

stream = IOUtils.toInputStream(outputStream.toString(),"UTF-8");

You are converting a ByteArrayOutputStream to a String using the default encoding of your platform, which is probably not UTF-8. Change it to

stream = IOUtils.toInputStream(outputStream.toString("UTF-8"),"UTF-8");

or, for better performance, just wrap the byte array in a ByteArrayInputStream :

return new ByteArrayInputStream(outputStream.toByteArray());
forty-two
  • 12,204
  • 2
  • 26
  • 36
  • thanks for the comment. That line is actually after the problem. The emoji is changed when I call the transformer. I've updated my code to reflect my latest changes. – l15a Aug 07 '13 at 16:24
0

Try to convert to String the XML using Apache Serializer.

//Serialize DOM
OutputFormat format    = new OutputFormat (doc); 
// as a String
StringWriter stringOut = new StringWriter ();    
XMLSerializer serial   = new XMLSerializer (stringOut, 
                                                  format);
serial.serialize(doc);
// Display the XML
System.out.println(stringOut.toString());
benka
  • 4,732
  • 35
  • 47
  • 58
0

just solved a similar problem by adding below line to original XML: document.appendChild(document.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING, ""));

refer to : Writing emoji to XML file in JAVA

perhaps can use similar setting for the transformer...

Léo Germond
  • 720
  • 8
  • 18