2

I have an org.w3c.dom.Document and want to serialize it with this function, but I get an SAXException. How could I fix this?

public static String serializeXmlDocument(Document document) throws Exception
{
    // set up a transformer
    TransformerFactory transformerFactory = TransformerFactory.newInstance();
    Transformer trans = transformerFactory.newTransformer();
    trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
    trans.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    trans.setOutputProperty(OutputKeys.INDENT, "yes");
    DOMSource source = new DOMSource(document);

    // create string from xml tree
    StringWriter stringWriter = new StringWriter();
    StreamResult stringResult = new StreamResult(stringWriter);
    trans.transform(source, stringResult);

    return stringWriter.toString();
}

This results in the following error:

2014-07-20 03:03:36,451 ERROR  [XXX] XXX main job error:  
javax.xml.transform.TransformerException: org.xml.sax.SAXException: E/A-Fehler
java.io.IOException: Ungültige UTF-16-Ersetzung festgestellt: d835 20 ?
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:758)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:359)
    at mypackage.handler.XmlHandler.serializeXmlDocument(XmlHandler.java:226)
    at mypackage.subpackage.buildSolrXml(MyJob.java:213)
    at mypackage.subpackage.doJob(MyJob.java:113)
    at mypackage.MyWorkstation.main(MyWorkstation.java:27)
Caused by: org.xml.sax.SAXException: E/A-Fehler
java.io.IOException: Ungültige UTF-16-Ersetzung festgestellt: d835 20 ?
    at com.sun.org.apache.xml.internal.serializer.ToStream.cdata(ToStream.java:1290)
    at com.sun.org.apache.xml.internal.serializer.ToStream.characters(ToStream.java:1395)
    at com.sun.org.apache.xml.internal.serializer.ToUnknownStream.characters(ToUnknownStream.java:814)
    at com.sun.org.apache.xml.internal.serializer.ToUnknownStream.characters(ToUnknownStream.java:348)
    at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:122)
    at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:230)
    at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:230)
    at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:230)
    at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:136)
    at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:98)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:702)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:746)
    ... 5 more
Caused by: java.io.IOException: Ungültige UTF-16-Ersetzung festgestellt: d835 20 ?
    at com.sun.org.apache.xml.internal.serializer.ToStream.writeUTF16Surrogate(ToStream.java:973)
    at com.sun.org.apache.xml.internal.serializer.ToStream.writeNormalizedChars(ToStream.java:1110)
    at com.sun.org.apache.xml.internal.serializer.ToStream.cdata(ToStream.java:1267)
    ... 16 more
Parker
  • 7,244
  • 12
  • 70
  • 92
wutzebaer
  • 14,365
  • 19
  • 99
  • 170

2 Answers2

1

This is not always caused by invalid UTF-16 characters. If a multi-byte UTF-8/16/32 character crosses a 1024 byte boundary anywhere in the Stream, the Xalan XSLTC processor will split the character into two pieces, which results in two incorrect characters being generated and (in most cases) will produce the above error.

This is due to a Xalan bug (1024-byte buffers), which will be fixed in OpenJDK 12.

The simplest file that triggers this bug is:

<?xml version="1.0" ?><x>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</x>

Update (April 9, 2021): It looks like this was "fixed" in Java 8u251 or 8u222 and 11.0.7. However, while the error is avoided, it looks like the character in question is ignored by the parser.

Parker
  • 7,244
  • 12
  • 70
  • 92
-1

The Document contained invalid Unicode characters like

http://www.fileformat.info/info/unicode/char/d835/index.htm

I fixed it with the solution from removing invalid XML characters from a string in java

// remove illegal unicode characters
    String xml10pattern = "[^"
            + "\u0009\r\n"
            + "\u0020-\uD7FF"
            + "\uE000-\uFFFD"
            + "\ud800\udc00-\udbff\udfff"
            + "]";

    stringValue = stringValue.replaceAll(xml10pattern, " ");
Community
  • 1
  • 1
wutzebaer
  • 14,365
  • 19
  • 99
  • 170