Work with raw text in javax.xml.transform.Transformer

Question

While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:

String s = "This &mdash; That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));

System.out.println(stringWriter.toString()); // outputs "This &amp;mdash; That" at the relevant Node.

I have no control over the input string and I need exactly the output "This — That".

If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.

I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".

What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?

Please explain how this is a duplicate.

The question referenced had the problem that " 
" was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.

My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "&mdash;".

I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.

More complete code:

TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This &mdash; That");
document.appendChild(rootElement);

DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
                "-//Company//program//language",
                "test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());

// outputs xml header, then "<Test>This &amp;mdash; That</Test>"

Would you please re-review this for duplicate status. The duplicate question is generating output from String s. Generating output resolves entities, so s must be escaped. My question is generating input from String s. Generating input escapes entities, so s will acquire additional escape markup. My problem is not keeping entity characters, like the duplicate question. I'm keeping them well enough. Too well. I'm getting extras I don't want. — tzimnoch, Dec 02 '15 at 20:36
documentBuilder.parse creates a Document out of XML. I'm trying to create XML out of a Document. Push comes to shove I'm just going to use StringBuilder. — tzimnoch, Dec 02 '15 at 21:21
No; you're trying to create document from XML, transform it, then get XML back. As long as the XML you parse has the entities you want, that should work. Your actual problem is that `document.createTextNode` is precisely the opposite of what you're trying to do. — SLaks, Dec 02 '15 at 22:18
My workflow is not XML->Document->XML. It is database -> DOM Document -> XML (user editing) -> HTML(sometimes)/back to database(always). It is the DOM Document->XML portion where the problem resides. document.createTextNode(String) is showing how I'm creating this Node of the DOM. Transformer.transform(DOMSource, StreamResult) is how I'm converting the DOM into XML. If document.createTextNode is the problem, what function is the solution? I tried document.createCDATASection(String) with the same effect. — tzimnoch, Dec 02 '15 at 22:42
I could go directly from database -> XML by using StringBuilder, but I think using DOM Document as an intermediate step makes the code easier to maintain and more resilient. — tzimnoch, Dec 02 '15 at 22:44

score 1 · Accepted Answer · edited May 23 '17 at 10:27

1

The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.

So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.

Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .

To parse a single node, there is LSParser.parseWithContext.

edited May 23 '17 at 10:27

Community

1
1

answered Dec 08 '15 at 05:06

ivan_pozdeev

33,874
19
107
152

Thank you for taking the time to understand my issue and provide a few options. – tzimnoch Dec 09 '15 at 17:52

Work with raw text in javax.xml.transform.Transformer

1 Answers1