How to make XML Parser aware of all Character Entity References?

Question

I get arbitrary XML from a server and parse it using this Java code:

String xmlStr; // arbitrary XML input
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 
try {
    DocumentBuilder builder = factory.newDocumentBuilder();
    InputSource is = new InputSource(new StringReader(xmlStr));
    return builder.parse(is);
}
catch (SAXException | IOException | ParserConfigurationException e) {
    LOGGER.error("Failed to  parse XML.", e);
}

Every once in a while, the XML input contains some unknown entity reference like   and fails with an error, such as org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but not declared.

I could solve this problem by preprocessing the original xmlStr and translating all problematic entity references before parsing. Here's a dummy implementation that works:

protected static String translateEntityReferences(String xml) {
    String newXml = xml;
    Map<String, String> entityRefs = new HashMap<>();
    entityRefs.put("&nbsp;", "&#160;");
    entityRefs.put("&laquo;", "&#171;");
    entityRefs.put("&raquo;", "&#187;");
    // ... and 250 more...
    for(Entry<String, String> er : entityRefs.entrySet()) {
        newXml = newXml.replace(er.getKey(), er.getValue());
    }
    return newXml;
}

However, this is really unsatisfactory, because there are are a huge number of entity references which I don't want to all hard-code into my Java class.

Is there any easy way of teaching this entire list of character entity references to the DocumentBuilder?

Here you go: https://dev.w3.org/html5/html-author/charref Have fun! — Jim Garrison, Aug 04 '16 at 15:32
Looks like fun, but how do I convince my DocumentBuilder of the same? ;-) — dokaspar, Aug 04 '16 at 15:36
you can try this regex to replace the matching content with blank string. String regexex = "&|#|[A-Za-z]?(\\w+|\\d+);"; Pattern pattern = Pattern.compile(regexex); or else you can try JSOUP library. check the link [http://stackoverflow.com/questions/36026353/parsing-xml-file-containing-html-entities-in-java-without-changing-the-xml](http://stackoverflow.com/questions/36026353/parsing-xml-file-containing-html-entities-in-java-without-changing-the-xml). — Rishal, Aug 04 '16 at 16:30
also it looks like the same requirement. check if it helps you. — Rishal, Aug 04 '16 at 16:42
Maybe really some regex preprocessing is needed. However, my wish would be to get *any* references translated into the correct character (not only &nbsp, but the whole list...). I hoped Java has a solution for this already... — dokaspar, Aug 05 '16 at 05:01

score 1 · Answer 1 · answered Aug 04 '16 at 20:19

If you can change your code to work with StAX instead of DOM, the trivial solution is to use the XMLInputFactory property IS_REPLACING_ENTITY_REFERENCES set to false.

public static void main(String[] args) throws Exception
{
    String doc = "<doc>&nbsp;</doc>";
    ByteArrayInputStream is = new ByteArrayInputStream(doc.getBytes());

    XMLInputFactory xif = XMLInputFactory.newFactory();
    xif.setProperty(javax.xml.stream.XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
    XMLStreamReader xr = xif.createXMLStreamReader(is);

    while(xr.hasNext())
    {
        int t = xr.getEventType();
        switch(t) {
            case XMLEvent.ENTITY_REFERENCE:
                System.out.println("Entity: "+ xr.getLocalName());
                break;
            case XMLEvent.START_DOCUMENT:
                System.out.println("Start Document");
                break;
            case XMLEvent.START_ELEMENT:
                System.out.println("Start Element: " + xr.getLocalName());
                break;
            case XMLEvent.END_DOCUMENT:
                System.out.println("End Document");
                break;
            case XMLEvent.END_ELEMENT:
                System.out.println("End Element: " + xr.getLocalName());
                break;
            default:
                System.out.println("Other:  ");
                break;
        }
        xr.next();
    }
}

Output:

Start Document
Start Element: doc
Entity: nbsp null
End Element: doc

But that may require too much rewrite in your code if you really need the full DOM tree in memory.

I spent an hour tracing through the DOM implementation and couldn't find any way to make the DOM parser read from an XMLStreamReader.

Also there is evidence in the code that the internal DOM parser implementation has an option similar to IS_REPLACING_ENTITY_REFERENCES but I couldn't find any way to set it from the outside.

The really sad part is that the code that scans the entity reference and throws the exception (`com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLStringBuffer)`) actually checks a `fReplaceEntityReferences` option flag. If I manually tweak this to `false` in the debugger the code builds the DOM just how you want it. But there appears to be no way to set it from the public API and no way to get access to the implementation either. — Jim Garrison, Aug 05 '16 at 05:41

How to make XML Parser aware of all Character Entity References?

1 Answers1