1

Consider:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.newDocument();

Element root = doc.createElement("list");
doc.appendChild(root);

for(CorrectionEntry correction : dictionary){
    Element elem = doc.createElement("elem");
    elem.setAttribute("from", correction.getEscapedFrom());
    elem.setAttribute("to", correction.getEscapedTo());
    root.appendChild(elem);
}

(then follows the writing of the document into an XML file)

where getEscapedFrom and getEscapedTo return (in my code) something like finké if the originating word is finké. So as to perform a Unicode escape for the characters that are bigger than 127.

The problem is that the final XML has the following line <elem from="finke" to="fink&amp;#xE9;" /> (from is finke, to is finké) where I would like it to be <elem from="finke" to="fink&#xE9;" />

I've tried, following another response in StackOverflow, to disable escaping of ampersands putting the line doc.appendChild(doc.createProcessingInstruction(StreamResult.PI_DISABLE_OUTPUT_ESCAPING, "&")); after the creation of the doc but without success.

How could I "tell XML" to not escape ampersands? Or, conversely, how could I let "XML" to convert from é, or \\u00E9, to &#xE9;?

Update

I managed to come to the problem: up until the writing of the file the node (through debug) seems to contain the right string. Once I call transformer.transform(domSource, streamResult); everything goes wild.

DOMSource domSource = new DOMSource(doc);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
StreamResult streamResult = new StreamResult(baos);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(domSource, streamResult);
System.out.println(baos.toString());

The problem seems to be the transformer.

Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
MauroT
  • 320
  • 2
  • 12
  • Such escaping are not particularly useful, and Java's Document model doesn't provide anything to do it. You could generate the XML first, and replace all non-ascii characters in after, the usual way. – kumesana Oct 25 '19 at 08:45
  • It would be useful if there is no other way to accomplish this... Having to open the generated file to substitute all the `&` into `&` doesn't seem to be the correct way to do things: what if there is a `&` that has not to be transformed into `&`? – MauroT Oct 25 '19 at 09:16

1 Answers1

1

Try setting setOutputProperty("encoding", "us-ascii") on the transformer. That tells the serializer to produce the output using ASCII characters only, which means any non-ASCII character will be escaped. But you can't control whether it will be a decimal or hex escape (unless you use Saxon-PE or higher as your Transformer, in which case there's a serialization option to control this).

It's never a good idea to try to do the serialization "by hand". For at least three reasons: (a) you'll get it wrong (we see a lot of SO questions caused by people producing bad XML this way), (b) you should be working with the tools, not against them, (c) the people who wrote the serializers understand XML better than you do, and they know what's expected of them. You're probably working to requirements written by someone whose understanding of XML is very superficial.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Close enough! Once I set `transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.US_ASCII.name());` I obtained `finké`. It's not hex, but I think the program that reads this xml won't complain. Thank you! – MauroT Oct 26 '19 at 16:38