1

I have a Java XML utility class. The buildDocument() method accepts an XML string and returns org.w3c.dom.Document. The particular XML I'm passing to it is an xhtml 1.1 document.

The issue is if there are HTML named entities like,

<p>Preserve dagger &dagger;</p>

the output is,

<p>Preserve dagger </p>

It does preserve &lt;, &gt;, &amp;, &quot;.

Here is the class creating Document.

package com.example;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.StringWriter;
import java.nio.charset.StandardCharsets;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;

public final class XMLUtils {

    private XMLUtils() {
    }

    public static Document buildDocument(String xml) throws ParserConfigurationException, SAXException, IOException {

        DocumentBuilderFactory domFactory = DocumentBuilderFactory
            .newInstance();
        domFactory.setNamespaceAware(true);

        domFactory.setFeature("http://xml.org/sax/features/validation", false);
        domFactory.setFeature(
            "http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
        domFactory.setFeature(
            "http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
        domFactory.setCoalescing(false);
        DocumentBuilder builder = domFactory.newDocumentBuilder();

        Document doc = builder.parse(new ByteArrayInputStream(
                xml.getBytes(StandardCharsets.UTF_8)));

        try {
            DOMSource domSource = new DOMSource(doc);
            StringWriter writer = new StringWriter();
            StreamResult result = new StreamResult(writer);
            TransformerFactory tf = TransformerFactory.newInstance();
            Transformer transformer = tf.newTransformer();
            transformer.transform(domSource, result);
            System.out.println("XML OUT: \n" + writer.toString());
        } catch (Exception ex) {

        }

        return doc;
    }
}

I think these are the relevant dependencies.

<dependency>
    <groupId>net.sf.saxon</groupId>
    <artifactId>Saxon-HE</artifactId>
    <version>9.5.1-6</version>
</dependency>
<dependency>
    <groupId>xerces</groupId>
    <artifactId>xercesImpl</artifactId>
    <version>2.11.0</version>
    <type>jar</type>
</dependency>
<dependency>
    <groupId>xml-resolver</groupId>
    <artifactId>xml-resolver</artifactId>
    <version>1.2</version>
    <type>jar</type>
</dependency>

Any ideas on how to preserve these entities? Thanks, /w

Ortomala Lokni
  • 56,620
  • 24
  • 188
  • 240
wsams
  • 2,499
  • 7
  • 40
  • 51
  • look at this http://stackoverflow.com/questions/4095451/java-xml-processing-entity-problem – Naren Feb 20 '15 at 07:31
  • @Naren I read through that question and I'm not sure it applies to this situation. I have DTD validation turned off. Was hoping to pass through all entities. I've been looking into an Entity Resolver - just not sure how to implement it yet, or if it will solve this problem. – wsams Feb 20 '15 at 14:06
  • When I set an entity resolver the `resolveEntity(publicId, systemId)` method is never called. I'm trying to turn on dtd loading but having to fix other cascading exceptions. – wsams Feb 20 '15 at 14:50

1 Answers1

0

It took me some time to find a solution to this problem, apparently it is difficult to search the right keywords... since I found this one before finding the best answer, I thought it was worth linking it here, even if it is on StackOverflow anyway. There you go: Keep numeric character entity characters such as `&#10; &#13;` when parsing XML in Java

It is not quite satisfactory, but at least it explains very well why there is no better solution.

Community
  • 1
  • 1