0

I want to parse the following xml structure:

<?xml version="1.0" encoding="utf-8"?>
<documents>
  <document>
    <element name="title">
      <value><![CDATA[Personnel changes: Müller]]></value>
    </element>
  </document>
</documents>

For parsing this element name="????? structure I use XPath in the following way:

XPath xPath = XPathFactory.newInstance().newXPath();

String currentString = (String) xPath.evaluate("/documents/document/element[@name='title']/value",pCurrentXMLAsDOM, XPathConstants.STRING);

The parsing itself works fine, but there are just some problems with german umlauts (vowels) like "Ü", "ß" or something like this. When I print out currentString the String is:

Personnel changes: Müller

But I want to have the String like in the Xml:

Personnel changes: Müller

Just to add: I cant change the content of the xml file, I have to parse it like I get it, so I definitely have to parse everey String in the correct way.

durron597
  • 31,968
  • 17
  • 99
  • 158
Metalhead89
  • 1,740
  • 9
  • 30
  • 58
  • 2
    What encoding is your JVM running with ? Can you set it to UTF-8 using -Dfile.encoding ? – Brian Agnew Aug 08 '12 at 09:36
  • Thanks for this comment. How can I change this programatically? I think -Dfile.encoding is a command line argument isnt it? But if I have to change it I would like to do it inside my code – Metalhead89 Aug 08 '12 at 09:39
  • `System.out.println(System.getProperty("file.encoding"));` return "Cp1252" – Metalhead89 Aug 08 '12 at 09:42
  • Where does `pCurrentXMLAsDOM` come from? Is it read from a file? How? –  Aug 08 '12 at 10:10
  • First I read the xml in a BufferedReader (FileReader) and store it to a String. Then I strip non valid characters (because the xml contains binary data which I cant parse in a normal way) and then I convert the String to a document object And all this works fine and I can read exectly every xml element. There are just these problems with the character set – Metalhead89 Aug 08 '12 at 10:15

3 Answers3

2

Sounds like an encoding problem. The XML is UTF-8 encoded Unicode which you seem to print encoded as ISO-8859-1. Check the encoding settings of your Java source.

Edit: See Setting the default Java character encoding? for how to set file.encoding.

Community
  • 1
  • 1
  • I am sorry but I do not see a good solution in this link. Ok they tell that you could change OS environment variable `JAVA_TOOL_OPTIONS` to `-Dfile.encoding=UTF8` bt this is no solution for me. I cant change this variable because my application has to run on every pc without changing system variables. – Metalhead89 Aug 08 '12 at 10:08
  • Also just for fun I just tried to set -Dfile.encoding="UTF-8" as a JVM argument and it also changed the default character set to UTF-8 but the String is still wrong: `Personnel changes: Müller` – Metalhead89 Aug 08 '12 at 10:12
1

I found a good and fast solution now:

public static String convertXMLToString(File pCurrentXML) {

        InputStream is = null;
        try {
            is = new FileInputStream(pCurrentXML);
        } catch (FileNotFoundException e1) {
            e1.printStackTrace();
        }
        String contents = null;
         try {

                try {
                    contents = IOUtils.toString(is, "UTF-8");
                } catch (IOException e) {
                    e.printStackTrace();
                }
            } finally {
                IOUtils.closeQuietly(is);
            }

        return contents;

    }

Afterwars I convert the String to a DOM object:

static Document convertStringToXMLDocumentObject(String string) {

        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = null;
        Document document = null;

        try {
            builder = factory.newDocumentBuilder();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }

        try {
            document = builder.parse(new InputSource(new StringReader(string)));
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

        return document;

    }

And then I can just parse the DOM with XPath for example and all element values are in UTF-8!! Demonstration:

currentString = (String) xPath.evaluate("/documents/document/element[@name='title']/value",pCurrentXMLAsDOM, XPathConstants.STRING);
System.out.println(currentString);

Output:

Personnel changes: Müller

:)

Metalhead89
  • 1,740
  • 9
  • 30
  • 58
0

if you know file is utf8 encoded try something like :

    FileInputStream fis = new FileInputStream("yourfile.xml");
    InputStreamReader in = new InputStreamReader(fis, "UTF-8");

    InputSource pCurrentXMLAsDOM = new InputSource(in);
mabroukb
  • 691
  • 4
  • 11
  • Thats a good idea but InputSource is from the SAXParser right? And I am not shure how to parse my xml file with SAX, because of this: ` <![CDATA[Personnel changes: Müller]]> ` – Metalhead89 Aug 09 '12 at 06:37