0

I want to parse the xml below using dom parsing in java.

<?xml version="1.0" encoding="utf-8" ?>
<PFA date="201303312200" type="daily">
<Person id="90061" action="chg" date="31-Mar-2013">
<Gender>Male</Gender> 
<ActiveStatus>Active</ActiveStatus> 
<Deceased>No</Deceased> 
<NameDetails>
<Name NameType="Primary Name">
<NameValue>
<TitleHonorific>Major General</TitleHonorific> 
<FirstName>Aslan</FirstName> 
<MiddleName>Ibraimis Dze</MiddleName> 
<Surname>Abashidze</Surname> 
<OriginalScriptName>مرحبا</OriginalScriptName> 
</NameValue>
</Name>
</NameDetails>
</Person></PFA>

While parsing this using the following java code

public class ParseXml {
    public static void main(String[] args) {
        String file = "PFA2_201303312200_D.xml";
        if (args.length > 0) {
        file = args[0];
        }
        try{
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();


        Document document = builder.parse(new File(file));
        System.out.println("Encoder Forment : " +document.getInputEncoding());
        Element parentRoot = document.getDocumentElement();
        System.out.println("Master Node is : "+parentRoot.getTagName());
        for(int i =0;i<parentRoot.getChildNodes().getLength();i++){
            Element root = (Element)parentRoot.getChildNodes().item(i);

The file is already a utf-8 file and while reading the data from a IDE (Eclipse) I m getting the data other language scripts as ???????. How can I resolve this?

  • Is the document *actually* UTF-8-encoded, or does it just claim to be? Are you sure that the problem isn't just in terms of what `System.out.println` shows, e.g. that the correct value is in there, but can't be displayed on your console? – Jon Skeet Jun 18 '15 at 13:21
  • u may want to check out eclipse encoding too. I'm not quite sure if it helps though. http://stackoverflow.com/questions/3751791/how-to-change-default-text-file-encoding-in-eclipse – nafas Jun 18 '15 at 13:22
  • Yes the document is for sure UTF-encoded, my requirement is we need to parse the xml and i had done this using dom and need to insert in the database. so i had send these values to a hash map and stored to beans so finally i m retrieving this. – Krishna Bharadwaj Jun 18 '15 at 13:32
  • thanks nafas, the file is already a utf-8 formatted so i think we need not to specify the setEncoding(). – Krishna Bharadwaj Jun 18 '15 at 13:48

2 Answers2

0

The problem has nothing to do with the XML itself. Java strings are UTF-16 encoded, and the Document is correctly decoding the XML data from UTF-8 to UTF-16 strings. The real problem is that you have Eclipse configured to use a console charset that does not support the characters you are trying to output (Arabic, etc) so they get replaced with ? instead. Try setting the console charset to UTF-8 instead and you should see the correct output, as UTF8<->UTF16 conversions are loss-less.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
0

Go to Run configurations in eclipse for this program and go to Common tab ,set encoding to utf-8 in other select button.

barun
  • 393
  • 1
  • 5
  • 19