1

Please note this is not the same question as mentioned above since XML escaping to preserve codepoints is possible.

I have a UTF-8 XML file which I can send via HTTP to some other system which I have no control over. For whatever crazy reason it decides to convert it to ISO-8859-1 loosing many Unicode characters and replacing them with '?'. This system then sends someone else this converted XML document.

How in Java on the sending side can I escape any arbitrary XML with non ASCII codepoints so that they survive this intermediary system and can still be decoded correctly by the endpoint?

A --(UTF-8)--> B --(ISO-8859-1)--> C (Decodes to internal Unicode representation).

import java.text.Normalizer;
import java.text.Normalizer.Form;

import org.apache.commons.lang3.StringEscapeUtils;
import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
import org.apache.commons.lang3.text.translate.NumericEntityEscaper;

public class Test {
    private static CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_XML
            .with(NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE));

    public static void main(String[] args) {
        String s = "<note>\n<to>Tove</to>\n<from>Jani</from>\n<heading>Reminder</heading>\n<body>Don't forget me this weekend!test☠ä</body>\n</note>";
        String xmlEscapedS = xmlToRobustXml(s);
        System.out.println(xmlEscapedS);
    }

    /**
     * @param s
     * @return
     */
    public static String xmlToRobustXml(String s) {
        s = Normalizer.normalize(s, Form.NFC);
        String xmlEscapedS = translator.translate(s);
        return xmlEscapedS;
    }
}

I tried this but it escapes everything.

&lt;note&gt;
&lt;to&gt;Tove&lt;/to&gt;
&lt;from&gt;Jani&lt;/from&gt;
&lt;heading&gt;Reminder&lt;/heading&gt;
&lt;body&gt;Don&apos;t forget me this weekend!test&#9760;&#228;&lt;/body&gt;
&lt;/note&gt;
Eric des Courtis
  • 5,135
  • 6
  • 24
  • 37
  • possible duplicate of [How do I convert between ISO-8859-1 and UTF-8 in Java?](http://stackoverflow.com/questions/652161/how-do-i-convert-between-iso-8859-1-and-utf-8-in-java) – Paul Vargas May 22 '13 at 21:15
  • @PaulVargas Slightly different in the case of XML since xx; is possible. I am not sure how to do this with any existing XML library however. – Eric des Courtis May 22 '13 at 21:19
  • Can't you just send it an ISO-8859-1 encoded document? All the code points outside that range can then be escaped using [character references](http://www.w3.org/TR/REC-xml/#sec-references). – McDowell May 22 '13 at 21:22
  • @McDowell Can you tell me which library does this? I have no experience in this. – Eric des Courtis May 22 '13 at 21:24
  • How do you produce your XML file? Do you use `javax.xml.Transformer` or something else? – parsifal May 22 '13 at 21:47
  • @parsifal simple-xml but I don't mind switching to something else or passing it through some other xml parser to solve my problem. – Eric des Courtis May 22 '13 at 21:48

2 Answers2

2

Here are three standard API methods to produce ISO-8859-1 encoded documents.

Using the StAX API:

// output stream
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
// transcode
StringReader xml = new StringReader("<x>pi: \u03A0</x>");
XMLEventReader reader = XMLInputFactory.newFactory().createXMLEventReader(
    xml);
XMLEventWriter writer = XMLOutputFactory.newFactory().createXMLEventWriter(
    buffer, "ISO-8859-1");
try {
  writer.add(reader);
} finally {
  writer.close();
}
// proof
String decoded = new String(buffer.toByteArray(),
    Charset.forName("ISO-8859-1"));
System.out.println(decoded);

Using the DOM API:

// output stream
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
// create XML DOM
InputSource src = new InputSource(new StringReader("<x>pi: \u03A0</x>"));
Document doc = DocumentBuilderFactory.newInstance()
    .newDocumentBuilder()
    .parse(src);
// serialize
DOMImplementationLS impl = (DOMImplementationLS) doc.getImplementation();
LSOutput out = impl.createLSOutput();
out.setEncoding("ISO-8859-1");
out.setByteStream(buffer);
impl.createLSSerializer().write(doc, out);
// proof
String decoded = new String(buffer.toByteArray(),
    Charset.forName("ISO-8859-1"));
System.out.println(decoded);

Using the transform package:

// output stream
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
// transformation
StreamSource src = new StreamSource(new StringReader("<x>pi: \u03A0</x>"));
StreamResult res = new StreamResult(buffer);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
transformer.transform(src, res);
// proof
String decoded = new String(buffer.toByteArray(),
    Charset.forName("ISO-8859-1"));
System.out.println(decoded);

Which you would use depends on your use case; the StAX API is probably the most efficient.

All this sample code will emit documents equivalent to:

<?xml version="1.0"?><x>pi: &#x3a0;</x>
McDowell
  • 107,573
  • 31
  • 204
  • 267
0

The Unicode code points above 127 can be encoded as numeric entities like &#123; using the following:

From Apache commons StringEscapeUtils. Read the javadoc, by default escapeXML does not convert to numeric entities.

StringEscapeUtils.ESCAPE_XML
    .with(NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE));

BTW you also try sending the original XML using a header Content-Type: application/x-xml, so binary transfer.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Node B receives correctly as UTF-8 but then converts to ISO-8859-1 when writing to disk. Unfortunately I have no control over this part. – Eric des Courtis May 22 '13 at 21:27