Reading XML file encoded in UTF16 in Java

Question

I am trying to read a UTF-16 xml file with Java. The file was written with C#.

Here's the java code:

import java.io.File;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class XMLReadTest
{
    public static void main(String[] s)
    {
        try
        {
            File fXmlFile = new File("C:\\my_file.xml");

            DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
            Document doc = dBuilder.parse(fXmlFile);

            doc.getDocumentElement().normalize();

            NodeList nList = doc.getElementsByTagName("row");

            for (int temp = 0; temp < nList.getLength(); temp++)
            {
                Node nNode = nList.item(temp);

                if (nNode.getNodeType() == Node.ELEMENT_NODE)
                {
                    Element eElement = (Element) nNode;

                    System.out.println("FILE_NAME: " + eElement.getElementsByTagName("FILE_NAME").item(0).getTextContent());
                }
            }
        }
        catch(Exception ex)
        {
            ex.printStackTrace();
        }
    }
}

And here's the xml file:

<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<docMetadata>
  <row>
    <FILE_NAME>Выписка_Винтовые насосы.pdf</FILE_NAME>
    <FILE_CAT>GENERAL</FILE_CAT>
  </row>
</docMetadata>

When running this code in eclipse and in the Run/Debug settings window, in the last tab named 'Common' the selected encoding is the Default - Inherited (Cp1253), the output I get is wrong:

FILE_NAME: ???????_???????? ??????.pdf

When the selecdted encoding in the same tab is UTF-8 then the output is OK:

FILE_NAME: Выписка_Винтовые насосы.pdf

What am I doing wrong?

How can I get the correct output with the default encoding (cp 1253) in eclipse project settings?

This code runs in a server where I don't want to change the default encoding of the virtual machine.

I have tested this code with both Java 7 and Java 8

Are you sure that the contents of the string differs, rather than it just being the console output being a problem? If your default encoding can't represent the characters you're trying to output, it's not going to work... — Jon Skeet, Jun 18 '15 at 14:40
It's not only the console output. String value in the debugger shows as "?????????.pdf" also. Remy Lebeau and user3141592 below explain why it does not work in my case with Eclipse. But if we put eclipse aside, how can I fix this problem in a program that is executed with an .exe file in a Windows Server? Do I need to change something in my code, or is it something in Windows settings? — crapatzi, Jun 19 '15 at 07:20
I'm very surprised that it's not working, to be honest. Can you put a sample file on the web, just so we can validate the encoding is correct? — Jon Skeet, Jun 19 '15 at 07:33
Here's the file: https://drive.google.com/file/d/0B86OsnzLlycjQm5RTHBUbXMyUnM/view?usp=sharing Opening the file with Notepad++ shows 'UCS-2 Big Endian' In the real program execution scenario, the xml is generated as a String from a C# web service like this: XElement doc = new XElement("docMetadata"); // add data to my xml, then: XDocument xmlInUtf16 = new XDocument(new XDeclaration("1.0", "UTF-16", "yes"), doc); StringWriter sr = new StringWriter(); xmlInUtf16.Save(sr); String xmlStr = sr.ToString(); return xmlStr Continued below... — crapatzi, Jun 19 '15 at 08:21
Then Java takes this String and puts it in the DocumentBuilder for parsing. In the eclipse debugger, the file name looks OK in Russian when I view the xml String. But when I extract it's value from the document builder it comes out as ???????.pdf (when using Cp1253) in Eclipse. It comes out OK when using UTF-8 in Eclipse.. It comes out wrong when my program is running in production. — crapatzi, Jun 19 '15 at 08:22
Okay - I'll try that file myself with your code... (As I said, the console output is a red herring here, IMO. You really want to know whether the values in the string are okay, regardless of whether they can be displayed on yuor console. I would print the integer value of `text.charAt(0)` for example.) — Jon Skeet, Jun 19 '15 at 08:24

Remy Lebeau · Answer 1 · 2015-06-18T23:11:50.327

1

The problem has nothing to do with the XML itself. Java strings are UTF-16 encoded, and the Document is correctly decoding the XML data to UTF-16 strings. The real problem is that you have Eclipse set to use cp1253 (Windows-1253 Greek, which is slightly different than ISO-8859-7 Greek) for its console charset, but most of the Unicode characters you are trying to output (Russian) simply do not exist in that charset, so they get replaced with ? instead. That also explains why the output is correct when the console charset is set to UTF-8 instead, as UTF8<->UTF16 conversions are loss-less.

edited Jun 18 '15 at 23:11

answered Jun 18 '15 at 23:02

Remy Lebeau

555,201
31
458
770

You are right (same explanation as user3141592 below). This problem came up in a java program which is developed with Eclipse and in production environment it is executed with an .exe file in a Windows Server. In the production environment the file name comes out ???????????.pdf. If we put Eclipse aside, how can I fix this problem in production environment? – crapatzi Jun 19 '15 at 07:11
Unfortunately, Windows' console has very limited support for Unicode, not much you can do about that when using `System.out`, except maybe for this: http://stackoverflow.com/a/15516722/65863 – Remy Lebeau Jun 19 '15 at 15:28

score 0 · Answer 2 · answered Jun 18 '15 at 15:33

0

Try to set the encoding explicitly in the input stream:

Document doc = dBuilder.parse(new InputStreamReader(new FileInputStream(fXmlFile), "UTF-16"));

answered Jun 18 '15 at 15:33

polypiel

2,321
1
19
27

score 0 · Answer 3 · answered Jun 18 '15 at 15:58

0

How can I get the correct output with the default encoding (cp 1253) in eclipse project settings?

You can't. To see the correct output, the console must know the characters to display.

This code runs in a server where I don't want to change the default encoding of the virtual machine.

You could write a UTF-8/16 log file where you can see the output with cat from another console or a text editor.

            if (nNode.getNodeType() == Node.ELEMENT_NODE)
            {
                Element eElement = (Element) nNode;
                String message = "FILE_NAME: " + eElement.getElementsByTagName("FILE_NAME").item(0).getTextContent();
                System.out.println(message);
                // output FILE_NAME to logfile.txt (quick and dirty)
                OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(new File("logfile.txt")), "UTF-8");
                writer.write(message);
                writer.close();
            }

I ran this code in eclipse with ISO-8859-1 encoding in the run configuration.

Eclipse output: FILE_NAME: ???????_???????? ??????.pdf

logfile output: FILE_NAME: Выписка_Винтовые насосы.pdf

answered Jun 18 '15 at 15:58

user3141592

131
6

The Russian characters in question do not exist in ISO-8859-1, either. You have to use ISO-8859-5 instead, if not UTF-8/16. – Remy Lebeau Jun 18 '15 at 23:04
1

Yes, I picked ISO-8859-1 to show that the problem is the output in the console and not the reading of the xml file. – user3141592 Jun 18 '15 at 23:20
You are correct. In your example above (with Eclipse setup to use Cp1253), the eclipse console output shows ??????????.pdf but the logfile.txt shows the file name in Russian. This problem came up in a java program which is developed with Eclipse and in production environment it is executed with an .exe file in a Windows Server. In the production environment the file name comes out ???????????.pdf. If we put Eclipse aside, how can I fix this problem in production environment? – crapatzi Jun 19 '15 at 07:09
There's not much you can do on java side. You have to set the console charset in the production environment. Example on linux: `LANG=en_GB.ISO-8859-1` gives me FILE_NAME: ???????_???????? ??????.pdf. Setting the console to `LANG=en_GB.UTF-8` gives me the right output: FILE_NAME: Выписка_Винтовые насосы.pdf – user3141592 Jun 19 '15 at 11:35

score 0 · Accepted Answer · answered Jul 16 '15 at 06:27

I was using an old dom4j library to parse the xml and that was causing the problem. Using the JVM 1.7 embeded library solved the problem:

import java.io.File;
import java.io.StringReader;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public XMLDoc()
    {
        try
        {
            File xmlFile = new File("C:\\my_file.xml");
            DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
            Document doc = dBuilder.parse(xmlFile);
            doc.getDocumentElement().normalize();

            NodeList nList = _doc.getElementsByTagName("row");
            for (int i = 0; i < nList.getLength(); i++)
            {
                Node nNode = nList.item(i);

                if (nNode.getNodeType() == Node.ELEMENT_NODE)
                {
                    Element eElement = (Element) nNode;
                    Node itemNode = eElement.getElementsByTagName("FILE_NAME").item(0);
                    String text = itemNode != null ? itemNode.getTextContent() : "";

                    // russian text is fine here
                }
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }

Reading XML file encoded in UTF16 in Java

4 Answers4