1

I have been trying to convert a UTF-8 String to its relative in ISO-8859-1 for outputting it in an XML document, and no matter what I try, the output is always wrongly displayed.

For simplifying the question, I created a code snippet with all the tests I did, and I copy/paste after that the generated document.

You can also be sure I tried all the combination possible between new String(xxx.getBytes("UTF-8"), "ISO-8859-1"), by switching UTF & ISO, and sometimes also by setting the same value. Nothing works !

Here's the snippet :

// @see http://stackoverflow.com/questions/229015/encoding-conversion-in-java
private static String changeEncoding(String input) throws Exception {
    // Create the encoder and decoder for ISO-8859-1
    Charset charset = Charset.forName("ISO-8859-1");
    CharsetDecoder decoder = charset.newDecoder();
    CharsetEncoder encoder = charset.newEncoder();

    // Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
    // The new ByteBuffer is ready to be read.
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(input));

    // Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
    // The new ByteBuffer is ready to be read.
    CharBuffer cbuf = decoder.decode(bbuf);
    return cbuf.toString();
}

// @see http://stackoverflow.com/questions/655891/converting-utf-8-to-iso-8859-1-in-java-how-to-keep-it-as-single-byte
private static String byteEncoding(String input) throws Exception {
    Charset utf8charset = Charset.forName("UTF-8");
    Charset iso88591charset = Charset.forName("ISO-8859-1");

    ByteBuffer inputBuffer = ByteBuffer.wrap(input.getBytes());

    // decode UTF-8
    CharBuffer data = utf8charset.decode(inputBuffer);

    // encode ISO-8559-1
    ByteBuffer outputBuffer = iso88591charset.encode(data);
    byte[] outputData = outputBuffer.array();
    return new String(outputData, "ISO-8859-1");
}

public static Result home() throws Exception {
    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder docBuilder = docFactory.newDocumentBuilder();

    //root elements
    Document doc = docBuilder.newDocument();
    doc.setXmlVersion("1.0");
    doc.setXmlStandalone(true);

    Element rootElement = doc.createElement("test");
    doc.appendChild(rootElement);

    rootElement.setAttribute("original", "héllo");

    rootElement.setAttribute("stringToString", new String("héllo".getBytes("UTF-8"), "ISO-8859-1"));

    rootElement.setAttribute("stringToBytes", changeEncoding("héllo"));

    rootElement.setAttribute("stringToBytes2", byteEncoding("héllo"));

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");

    StringWriter writer = new StringWriter();
    transformer.transform(new DOMSource(doc), new StreamResult(writer));
    String output = writer.getBuffer().toString().replaceAll("\n|\r", "");

    // The following is Play!Framework specifics for rendering an url, but I believe this is not the problem (I checked in the developer console, the document is correctly in "ISO-8859-1"
    response().setHeader("Content-Type", "text/xml; charset=ISO-8859-1");
    return ok(output).as("text/xml");
}

And the result :

<?xml version="1.0" encoding="ISO-8859-1"?>
<test original="héllo" stringToBytes="héllo" stringToBytes2="héllo" stringToString="héllo"/>

How can I proceed?

halfer
  • 19,824
  • 17
  • 99
  • 186
Cyril N.
  • 38,875
  • 36
  • 142
  • 243
  • I think you mispelled `response`. If your talking about `response()` from Play!Frameowork, there is no `setCharacterEncoding()` (I'm using Play! 2.1.5). There is also no `setCharacterEncoding()` in doc` (Document) – Cyril N. Feb 23 '14 at 11:00
  • Thanks for your help. I already set the encoding to "ISO-8859-1" by calling `setHeader`. There is no `encoding` in Play v2.1.5 (but there is a CONTENT_ENCODING, which is final) – Cyril N. Feb 23 '14 at 11:20
  • Sorry again. I read 1.2.5 and not 2.1.5. – JB Nizet Feb 23 '14 at 11:23
  • No problems ;) I also tried to only use encoding in http response without modifying the strings, but it didn't worked. It only works if I also remove the encoding in the xml document, but it's because the page is then displayed as UTF-8, which I don't want. – Cyril N. Feb 23 '14 at 11:24
  • You were kind of right. I finally switched the output from StringWriter to writing into a file, and then outputting directly this file as binary, and now everythings works fine, with the right encoding. No switching of encoding were done! You can add your comments as an anwser, I'll accept it :) – Cyril N. Feb 23 '14 at 15:24
  • You did all the job, and I wouldn't know how to turn it into a valid answer. Answer it yourself, and show what you did. I find it astonishing to be forced to go through a file to do that, but with Play, nothing really surprises me anymore. – JB Nizet Feb 23 '14 at 15:26

1 Answers1

2

For a reason I can't explain, by writing to a file and returning this file to the output fixed the problem of encoding.

I decided to keep this question in case other people had a similar problem.

Here's the snippet :

TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");

File file = new File("Path/to/file.xml");
transformer.transform(new DOMSource(doc), new StreamResult(file));

response().setHeader("Content-Disposition", "attachment;filename=" + file.getName());
response().setHeader("Content-Type", "text/xml; charset=ISO-8859-1");
return ok(file).as("text/xml");
Cyril N.
  • 38,875
  • 36
  • 142
  • 243