0

Ok after lot of search I decided to ask question here. Below is the sample code to reproduce my problem. The document object is build with chinese character.

String value= "";
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("value");      
root.setAttribute("attribute", value);
doc.appendChild(root);      
DOMSource source = new DOMSource(doc);  

I am trying to convert the document source to string using the Transformer class with the below code.

ByteArrayOutputStream outStream = null;
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamResult htmlStreamResult = new StreamResult( new ByteArrayOutputStream() );        
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");        
transformer.transform(source, htmlStreamResult);                    
outStream = (ByteArrayOutputStream) htmlStreamResult.getOutputStream();
String outPut = outStream.toString( "UTF-8" );

But I got output with converted Chinese characters as below.

<?xml version="1.0" encoding="UTF-8" standalone="no"?><value attribute="&#159776;"/>

I do not want the Chinese character to be converted but to be displayed as it is. Appreciate if anyone help me on this.

Mark Jeronimus
  • 9,278
  • 3
  • 37
  • 50
Balan
  • 13
  • 5
  • 1
    Possible duplicate of [transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8") is NOT working](https://stackoverflow.com/questions/15592025/transformer-setoutputpropertyoutputkeys-encoding-utf-8-is-not-working) \[edit] confirmed working in your case too. – Mark Jeronimus Nov 27 '19 at 13:20
  • 1
    It sounds like it's going to output xml anyhow if you're using a transformer in that way. I haven't worked with whatever library it is you're using, but you want to save the plain text _and_ have it utf-8 encoded. There is no ascii values for chinese logograms – Rogue Nov 27 '19 at 13:20
  • I have added the suggested solution from the duplicate question but it is not working for this case. Please can you recheck @MarkJeronimus – Balan Nov 27 '19 at 13:37
  • Yes there is no ascii values, just figured out it is a html entity for that chinese character. @Rogue – Balan Nov 27 '19 at 13:38
  • 1
    It's a numeric XML character reference. In properly-parsed XML those are **exactly identical** to actually putting the character there. It's okay if you prefer these to not use the character entity, but note that any correct parser would interpret the two as completely identical (i.e. only try to "fix" this if it annoys you, but if a recipient of that XML file actually treats them differently, then that recipient is at fault). – Joachim Sauer Nov 27 '19 at 13:53

1 Answers1

0

Change UTF-8 to UTF-16. Since you're making a String (which is code-page agnostic) this has no ill effect on the encoding. This however adds code-page declaration and sometimes a BOM (Byte-Order-Mark) in the XML header. You can optionally leave the header out and attach your own.

    String value= "かな〜"; // (I don't see your character so I added some of my own)
    DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
    Document doc = builder.newDocument();
    Element root = doc.createElement("value");
    root.setAttribute("attribute", value);
    doc.appendChild(root);
    DOMSource source = new DOMSource(doc);

    ByteArrayOutputStream outStream = null;
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    StreamResult htmlStreamResult = new StreamResult( new ByteArrayOutputStream() );
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
//  transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); // optional
    transformer.transform(source, htmlStreamResult);
    outStream = (ByteArrayOutputStream) htmlStreamResult.getOutputStream();
    String outPut = outStream.toString( "UTF-16" );
    System.out.println(outPut);

Output:

<?xml version="1.0" encoding="UTF-16" standalone="no"?><value attribute="かな〜"/>
Mark Jeronimus
  • 9,278
  • 3
  • 37
  • 50
  • You're genius. It works like a charm. You have saved my day :). @Mark Jeronimus – Balan Nov 28 '19 at 07:22
  • I just frankensteined your code with @Vyrx's answer in the duplicate thread and replaced the way that the output was extracted to your `ByteArrayOutputStream` method (instead of the `StringWriter`) – Mark Jeronimus Nov 28 '19 at 15:49