I get arbitrary XML from a server and parse it using this Java code:
String xmlStr; // arbitrary XML input
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlStr));
return builder.parse(is);
}
catch (SAXException | IOException | ParserConfigurationException e) {
LOGGER.error("Failed to parse XML.", e);
}
Every once in a while, the XML input contains some unknown entity reference like
and fails with an error, such as org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but not declared.
I could solve this problem by preprocessing the original xmlStr
and translating all problematic entity references before parsing. Here's a dummy implementation that works:
protected static String translateEntityReferences(String xml) {
String newXml = xml;
Map<String, String> entityRefs = new HashMap<>();
entityRefs.put(" ", " ");
entityRefs.put("«", "«");
entityRefs.put("»", "»");
// ... and 250 more...
for(Entry<String, String> er : entityRefs.entrySet()) {
newXml = newXml.replace(er.getKey(), er.getValue());
}
return newXml;
}
However, this is really unsatisfactory, because there are are a huge number of entity references which I don't want to all hard-code into my Java class.
Is there any easy way of teaching this entire list of character entity references to the DocumentBuilder?