2

I want to read an xml file from the internet. You can find it here.
The problem is that it is encoded in UTF-8 and I need to store it into a file in order to parse it later. I have already read a lot of topics about that and here is what I came up with :

BufferedReader in;
String readLine;
try
{
    in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
    BufferedWriter out = new BufferedWriter(new FileWriter(file));

    while ((readLine = in.readLine()) != null)
        out.write(readLine+"\n");

    out.close();
}

catch (UnsupportedEncodingException e)
{
    e.printStackTrace();
}

catch (IOException e)
{
    e.printStackTrace();
}

This code works until this line : <title>Chérie FM</title>
When I debug, I get this : <title>Ch�rie FM</title>

Obviously, there is something I fail to understand, but it seems to me that I followed the code saw on several website.

Vinay
  • 6,891
  • 4
  • 32
  • 50
Thibault
  • 568
  • 3
  • 10
  • 21
  • It is encoded in ISO-8889-1, not UTF-8. ``. I have also verified that the actual bytes sent are ISO-8889-1 as well. – Esailija Aug 01 '12 at 12:27
  • @Esailija: That's not what I see why I open the link shown. I see `` - although the contents do indeed appear to be ISO-8859-1. Weird. – Jon Skeet Aug 01 '12 at 12:30
  • @JonSkeet How do you see that? perso.mcom.fr/thibault/channelList.xml does not have that. It has `` with ISO-8859-1 bytes. – Esailija Aug 01 '12 at 12:31
  • @Esailija: Not for me, either with Chrome or via wget. Perhaps it's changing the declaration automatically based on some client header, but not changing the actual content encoding? – Jon Skeet Aug 01 '12 at 12:32
  • @JonSkeet they must be sniffing and changing it then. My browser sends this header: `Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3`. Maybe they just read the first thing of that and use it in the xml encoding attribute? who knows :D – Esailija Aug 01 '12 at 12:32
  • To all : I actually did not notice that it was not UTF-8, I was so sure of that, I did not checked. So I changed it quickly to UTF-8, and I got the error error on line 57 at column 11: Encoding error. So when I changed my code to ISO-8859-1, I got the good result! – Thibault Aug 01 '12 at 12:40
  • @Thibault the `encoding` xml attribute does not change a file's encoding in any way. It merely makes a claim to whoever is reading the file that the file is encoded using the given encoding in the attribute. The actual bytes of the file must be encoded using the claimed encoding attribute as well for everything to work smoothly. The actual bytes of your file are encoded in ISO-8859-1 so that's why you get the error when the attribute lies that it is UTF-8 – Esailija Aug 01 '12 at 12:55

2 Answers2

8

This file is not encoded as UTF-8, it's ISO-8859-1.

By changing your code to:

BufferedReader in;
String readLine;
try
{
    in = new BufferedReader(new InputStreamReader(url.openStream(), "ISO-8859-1"));
    BufferedWriter out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(file) , "UTF-8"));

    while ((readLine = in.readLine()) != null)
        out.write(readLine+"\n");
    out.flush();
    out.close();
}

catch (UnsupportedEncodingException e)
{
    e.printStackTrace();
}

catch (IOException e)
{
    e.printStackTrace();
}

You should have the expected result.

Maurício Linhares
  • 39,901
  • 14
  • 121
  • 158
-1

If you need to write a file in a given encoding, use FileOutputStream instead.

in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
FileOutputStream out = new FileOutputStream(file);

while ((readLine = in.readLine()) != null)
    write((readLine+"\n").getBytes("UTF-8"));

out.close();
Angelo Fuchs
  • 9,825
  • 1
  • 35
  • 72