8

Welcome all

I'm developing a Java app, that calls a PHP from internet that it's giving me a XML response.

In the response is contained this word: "Próximo", but when i parse the nodes of the XML and obtain the response into a String variable, I'm receiving the word like this: "Pr& oacute;ximo".

I'm sure that the problem is that i'm using different encoding in the Java app then encoding of PHP script. Then, i supose i must set encoding to the same as in your PHP xml, UTF-8

This is the code i'm using to geat the XML file from the PHP.

¿What should i change in this code to set the encoding to UTF-8? (note that im not using bufered reader, i'm using input stream)

        InputStream in = null;
        String url = "http://www.myurl.com"
        try {                              
            URL formattedUrl = new URL(url); 
            URLConnection connection = formattedUrl.openConnection();   
            HttpURLConnection httpConnection = (HttpURLConnection) connection;
            httpConnection.setAllowUserInteraction(false);
            httpConnection.setInstanceFollowRedirects(true);
            httpConnection.setRequestMethod("GET");
            httpConnection.connect();               
            if (httpConnection.getResponseCode() == HttpURLConnection.HTTP_OK)
                in = httpConnection.getInputStream();   

            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();                     
            DocumentBuilder db = dbf.newDocumentBuilder();
            Document doc = db.parse(in);
            doc.getDocumentElement().normalize();             
            NodeList myNodes = doc.getElementsByTagName("myNode"); 
Pableras84
  • 1,195
  • 5
  • 18
  • 30
  • 1
    Are you sure it's an encoding issue? Have you tested your PHP content with a web-browser? I think the source XML contains the escaped character! – Amir Pashazadeh Jul 22 '12 at 19:12
  • 1
    you asked another question here: http://stackoverflow.com/questions/11494069/problems-parsing-spanish-characters-a-e-i-o-u-from-xml-response the answer there from @kgb is what you should be looking at. this is not a problem of encoding. it seems the content of the xml is some html data, and that data was escaped. you need to unescape it. tyhe following link shows you how html escapes some special charachters in forign languages http://www.thesauruslex.com/typo/eng/enghtml.htm –  Jul 22 '12 at 19:17

1 Answers1

9

When you get your InputStream read byte[]s from it. When you create your Strings, pass in the CharSetfor "UTF-8". Example:

byte[] buffer = new byte[contentLength];
int bytesRead = inputStream.read(buffer);
String page = new String(buffer, 0, bytesRead, "UTF-8");

Note, you're probably going to want to make your buffer some sane size (like 1024), and continuously called inputStream.read(buffer).


@Amir Pashazadeh

Yes, you can also use an InputStreamReader, and try changing the parse() line to:

Document doc = db.parse(new InputSource(new InputStreamReader(in, "UTF-8")));
Jon Lin
  • 142,182
  • 29
  • 220
  • 220
  • What about InputStreamReader? – Amir Pashazadeh Jul 22 '12 at 19:11
  • Is there a solution that does not require reading the whole `InputStream` upfront into memory? Sometimes it can be quite big... – Tomasz Nurkiewicz Jul 22 '12 at 19:11
  • i can't fix my code with your solution... please, can you edit my code with your solution so i can test it with my php xml file? – Pableras84 Jul 22 '12 at 19:12
  • @TomaszNurkiewicz Yeah, you can write it to a file and then pass the file to `db.parse()` – Jon Lin Jul 22 '12 at 19:15
  • As others have pointed out, this is not an encoding issue. The HTML entity ó is being sent as part of the feed, and the only way to get a readable character from this is to translate the entity. – Bobulous Jul 22 '12 at 21:49