Java Unescaping XML/HTML before JAXB parsing doesn't work

Question

Can anyone help me?

In HTML/XML:
A numeric character reference refers to a character by its Universal Character Set/Unicode code point, and uses the format:

&#nnnn; or &#x hhhh;

I have to unescape (convert to unicode) these references before I use the JAXB parser.

When I use Apache StringEscapeUtils.unescapeXml() also &amp ; and &gt ; and &lt ; are unescaped, and that is not want I want because then parsing will fail.

Is there a library that only converts the &#nnnn to unicode? But does not unescape the rest?

Example:
begin-tag Adam &lt ;&gt ; Sl.meer 4 & 5 &# 55357;&# 56900; end-tag

I have added spaces after &# otherwise you do not see the notation.

For now I fixed it like this, but I want to use a better solution.

String unEncapedString = StringEscapeUtils.unescapeXml(xmlData).replaceAll("&", "&amp;")
                .replaceAll("<>", "&lt;&gt;");
StringReader reader = new StringReader(unEncapedString.codePoints().filter(c -> isValidXMLChar(c))
                .collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append).toString());
return (Xxxx) createUnmarshaller().unmarshal(reader);

Looked in the Apache Commons-text library and finally found the solution:

NumericEntityUnescaper numericEntityUnescaper = new NumericEntityUnescaper(
                    NumericEntityUnescaper.OPTION.semiColonRequired);
xmlData = numericEntityUnescaper.translate(xmlData);

could you provide a sample of the xml you are trying to unmarshal? — martidis, Feb 05 '18 at 09:07
Have you seen https://stackoverflow.com/questions/4435934 ? Why does [this approach](https://stackoverflow.com/questions/4435934) not work for you? Did you set `marshaller.setProperty("jaxb.encoding", "Unicode");` ? — jschnasse, Feb 05 '18 at 09:45
Caused by: javax.xml.bind.PropertyException: name: jaxb.encoding value: Unicode — Hans Schreuder, Feb 05 '18 at 13:23
unmarshalling doesn't recognize that option. And nnnnn or hhhh isn't unicode notation but HTML/XML escaped smileys etc. — Hans Schreuder, Feb 05 '18 at 13:25

Java Unescaping XML/HTML before JAXB parsing doesn't work

0 Answers0