1

I have XML file where some symbols were incorrectly encoded because of mixing UTF-16 and UTF-8.

For example, symbol is encoded as �� (��) instead of (📞).

I want to unmarshal this XML file but it fails when Unmarshaller meet these incorrect symbols. If I decode only them with StringEscapeUtils#unescapeHtml4 (or StringEscapeUtils#unescapeXml) everything works correctly.

But I don't want to read XML to string then decode it and then unmarshal.

How I could do the same inside the unmarshalling process (not to read XML file to string before)?

I created a simple test to reproduce this:

public class XmlReaderTest {

    private static final Pattern HTML_UNICODE_REGEX = Pattern.compile("&#[a-zA-Z0-9]+;&#[a-zA-Z0-9]+;");

    @Test
    public void test() throws Exception {
        final Unmarshaller unmarshaller = JAXBContext.newInstance(Value.class).createUnmarshaller();
        final XMLInputFactory factory = createXmlInputFactory();

        String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><value><name>&#55357;&#56542; &amp; &#128222; O&#771;</name></value>";

        XMLEventReader xmlReader = factory.createXMLEventReader(new StringReader(decodeHtmlEntities(xml)));
        Value result = (Value)unmarshaller.unmarshal(xmlReader);
        assert result.name.equals("\uD83D\uDCDE & \uD83D\uDCDE Õ");

        XMLEventReader xmlReader2 = factory.createXMLEventReader(new StringReader(xml));
        Value result2 = (Value)unmarshaller.unmarshal(xmlReader2); // ! exception
        assert result2.name.equals("\uD83D\uDCDE & \uD83D\uDCDE Õ");
    }

    @XmlRootElement(name = "value")
    private static class Value {
        @XmlElement
        public String name;
    }

    private String decodeHtmlEntities(String readerString) {
        StringBuffer unescapedString = new StringBuffer();

        Matcher regexMatcher = HTML_UNICODE_REGEX.matcher(readerString);
        while (regexMatcher.find()) {
            regexMatcher.appendReplacement(unescapedString, StringEscapeUtils.unescapeHtml4(regexMatcher.group()));
        }
        regexMatcher.appendTail(unescapedString);

        return unescapedString.toString();
    }

    private XMLInputFactory createXmlInputFactory() {
        XMLInputFactory factory = XMLInputFactory.newFactory();
        factory.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, false);
        factory.setProperty(XMLInputFactory.SUPPORT_DTD, false);
        return factory;
    }
}
Daria Pydorenko
  • 1,754
  • 2
  • 18
  • 45
  • Does this answer your question? [What is a "surrogate pair" in Java?](https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java) – JosefZ Apr 21 '21 at 10:00
  • @JosefZ I know what `surrogate pair` is. I want to decode it correctly using unmarshalling. The problem is not all symbols are presented in the same way. – Daria Pydorenko Apr 21 '21 at 10:17
  • 1
    The XML spec explicitly excludes the Unicode surrogate characters from XML documents: [any Unicode character, **excluding the surrogate blocks**, FFFE, and FFFF.](https://www.w3.org/TR/2006/REC-xml11-20060816/) – JosefZ Apr 21 '21 at 10:33
  • 2
    To reinforce what JosefZ says: this means that the XML that's encoded as such is explicitly malformed: it's not valid XML, declining to process it is the **correct** course of action. – Joachim Sauer Apr 21 '21 at 10:40
  • Thanks! So, is declining such XMLs only one correct way? – Daria Pydorenko Apr 21 '21 at 10:46
  • 2
    I'd replace all unwanted/illicit characters (surrogate pairs) using some _plain text_ processing (outside native XML tools). – JosefZ Apr 21 '21 at 11:53
  • 1
    @JosefZ While you have identified the problem, I'm not sure that the proposed duplicate is apprropriate, since that question does not even mention XML, yet XML is central to this question. Wouldn't [this question](https://stackoverflow.com/q/23239432/2985643) be a better candidate, especially given [this answer](https://stackoverflow.com/a/23243520/2985643)? – skomisa Apr 22 '21 at 19:49
  • This answer's solved my problem: https://stackoverflow.com/a/28650366/8253837. It doesn't really answer to my question, but it solves my real issue. – Daria Pydorenko Apr 23 '21 at 15:22

0 Answers0