I have XML file where some symbols were incorrectly encoded because of mixing UTF-16 and UTF-8.
For example, symbol is encoded as �� (��
) instead of (📞
).
I want to unmarshal this XML file but it fails when Unmarshaller meet these incorrect symbols. If I decode only them with StringEscapeUtils#unescapeHtml4
(or StringEscapeUtils#unescapeXml
) everything works correctly.
But I don't want to read XML to string then decode it and then unmarshal.
How I could do the same inside the unmarshalling process (not to read XML file to string before)?
I created a simple test to reproduce this:
public class XmlReaderTest {
private static final Pattern HTML_UNICODE_REGEX = Pattern.compile("&#[a-zA-Z0-9]+;&#[a-zA-Z0-9]+;");
@Test
public void test() throws Exception {
final Unmarshaller unmarshaller = JAXBContext.newInstance(Value.class).createUnmarshaller();
final XMLInputFactory factory = createXmlInputFactory();
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><value><name>�� & 📞 Õ</name></value>";
XMLEventReader xmlReader = factory.createXMLEventReader(new StringReader(decodeHtmlEntities(xml)));
Value result = (Value)unmarshaller.unmarshal(xmlReader);
assert result.name.equals("\uD83D\uDCDE & \uD83D\uDCDE Õ");
XMLEventReader xmlReader2 = factory.createXMLEventReader(new StringReader(xml));
Value result2 = (Value)unmarshaller.unmarshal(xmlReader2); // ! exception
assert result2.name.equals("\uD83D\uDCDE & \uD83D\uDCDE Õ");
}
@XmlRootElement(name = "value")
private static class Value {
@XmlElement
public String name;
}
private String decodeHtmlEntities(String readerString) {
StringBuffer unescapedString = new StringBuffer();
Matcher regexMatcher = HTML_UNICODE_REGEX.matcher(readerString);
while (regexMatcher.find()) {
regexMatcher.appendReplacement(unescapedString, StringEscapeUtils.unescapeHtml4(regexMatcher.group()));
}
regexMatcher.appendTail(unescapedString);
return unescapedString.toString();
}
private XMLInputFactory createXmlInputFactory() {
XMLInputFactory factory = XMLInputFactory.newFactory();
factory.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, false);
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false);
return factory;
}
}