Read unicode characters in XML in Java/Android

Question

I was trying to get the XML output with some Unicode characters. I couldn't read the complete string inside the tag but just one.

here is my XML output

 <item>
    <id>1</id>    
    <name>&#x0DBD;&#x0DDC;&#x0DBD;&#x0DCA;</name>
    <cost>155</cost>
    <description>&#x0DBD;&#x0DDC;</description>
</item>

This is my java code which I use to parse XML string.

    public Document getDomElement(String xml) {
Document doc = null;
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {

    DocumentBuilder db = dbf.newDocumentBuilder();

    InputSource is = new InputSource();
    is.setEncoding("UTF-16");
    is.setCharacterStream(new StringReader(xml));
    doc = db.parse(is);

} catch (ParserConfigurationException e) {
    Log.e("Error: ", e.getMessage());
    return null;
} catch (SAXException e) {
    Log.e("Error: ", e.getMessage());
    return null;
} catch (IOException e) {
    Log.e("Error: ", e.getMessage());
    return null;
}
// return DOM
return doc;
}

When I use normal English characters it gives the complete string.

When you try to parse the non-English chars what happens? The strings are not correct? Or it fails? — helios, Sep 21 '12 at 07:45
It doesn't fail. It just read only the first character. In this example it only output ල not ලොල් — Chrishan, Sep 21 '12 at 08:00
Oh, ok. But then two things: `valueOfTheContainedText.length()` returns 1 or 4?, and the xml, if you print it before parsing, is that, right? — helios, Sep 21 '12 at 08:08

score 1 · Answer 1 · answered Sep 21 '12 at 07:53

I've tried your code and there's no problem. If I evaluate the nodes with non-English chars the exists and have the correct number of chars. They're not printable because I don't have that glyphs in the font used, but value.codePointAt(i) returns the correct codepoint.

    NodeList list = doc.getDocumentElement().getChildNodes();
    for (int i=0; i<list.getLength(); i++)
    {
        String value = list.item(i).getTextContent();
        for (int j=0; j<value.length(); j++)
            System.out.print(" " + value.codePointAt(j));
        System.out.println();
    }

outputs:

 49
 3517 3548 3517 3530
 49 53 53
 3517 3548

which correspond to the decimal representation of your codepoints.

I've created the xml string by hand. You already have it in memory right?

this help me alot. but using this method I can't read node by node. I will put my code here. thanks alot. — Chrishan, Sep 21 '12 at 10:17

score 0 · Answer 2 · edited May 23 '17 at 11:55

0

By Unicode people usually mean UTF-8 but you are using UTF-16, which is bad
XML defines its own encoding in its header so you should not need to override it

edited May 23 '17 at 11:55

Community

1
1

answered Sep 21 '12 at 07:41

mauhiz

506
5
14

1

I was thinking in it, but in fact he's trying to read from a String in memory so in fact setting char encoding to the InputSource has no effect. And it has sense its xml in memory string doesn't have any encoding header because it's already decoded. – helios Sep 21 '12 at 07:43

score 0 · Accepted Answer · answered Sep 21 '12 at 10:20

This is the code I used to solve my problem.

   NodeList idlist = doc.getElementsByTagName(KEY_ID);
    NodeList namelist = doc.getElementsByTagName(KEY_NAME);
    NodeList costlist = doc.getElementsByTagName(KEY_COST);
    NodeList desclist = doc.getElementsByTagName(KEY_DESC);
    for (int i=0; i<idlist.getLength(); i++)
    {
        Item item = new Item();
        item.setCost(costlist.item(i).getTextContent());
        item.setDescription(desclist.item(i).getTextContent());
        item.setName(namelist.item(i).getTextContent());
        itemarray.add(item);

    }

Read unicode characters in XML in Java/Android

3 Answers3