8

I have a file which is encoded as iso-8859-1, and contains characters such as ô .

I am reading this file with java code, something like:

File in = new File("myfile.csv");
InputStream fr = new FileInputStream(in);
byte[] buffer = new byte[4096];
while (true) {
    int byteCount = fr.read(buffer, 0, buffer.length);
    if (byteCount <= 0) {
        break;
    }

    String s = new String(buffer, 0, byteCount,"ISO-8859-1");
    System.out.println(s);
}

However the ô character is always garbled, usually printing as a ? .

I have read around the subject (and learnt a little on the way) e.g.

but still can not get this working

Interestingly this works on my local pc (xp) but not on my linux box.

I have checked that my jdk supports the required charsets (they are standard, so this is no suprise) using :

System.out.println(java.nio.charset.Charset.availableCharsets());
Danielson
  • 2,605
  • 2
  • 28
  • 51
Joel
  • 29,538
  • 35
  • 110
  • 138
  • I should add that I am able to see the characters or the original file correctly using my linux terminal if I simply cat the contents of the file – Joel Jan 31 '09 at 11:45
  • What character encoding is being used by your terminal? – McDowell Jan 31 '09 at 11:59
  • Interestingly - if I add the runtime java property "-Dfile.encoding=UTF16" it works as expected, although I do not see why this should matter - and I do not see it as a solution, but more of a hack. It does not work with the property set to UTF8. – Joel Jan 31 '09 at 12:55

5 Answers5

15

I suspect that either your file isn't actually encoded as ISO-8859-1, or System.out doesn't know how to print the character.

I recommend that to check for the first, you examine the relevant byte in the file. To check for the second, examine the relevant character in the string, printing it out with

 System.out.println((int) s.getCharAt(index));

In both cases the result should be 244 decimal; 0xf4 hex.

See my article on Unicode debugging for general advice (the code presented is in C#, but it's easy to convert to Java, and the principles are the same).

In general, by the way, I'd wrap the stream with an InputStreamReader with the right encoding - it's easier than creating new strings "by hand". I realise this may just be demo code though.

EDIT: Here's a really easy way to prove whether or not the console will work:

 System.out.println("Here's the character: \u00f4");
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • have used linux file tool to test the type of the file: file --mime FranceJ2.csv FranceJ2.csv: text/plain; charset=iso-8859-1 and also confirmed that I can read it correctly, in say vi but i will follow your suggestions. – Joel Jan 31 '09 at 11:04
  • 1
    Don't trust tools that are trying to detect character encodings automatically. They're always just based on heuristics, and have to be. They don't know what text your file is really meant to contain. – Jon Skeet Jan 31 '09 at 11:06
  • A hexdump of the file yields: 0000000 0df4 000a (any suggestions!?) – Joel Jan 31 '09 at 11:10
  • Like Jon suggests in his article, verify the data at each step. If you don't run the code in debugger, you can dump hex bytes to console to make sure you have really data which you expect. (Esp. if it is this small) – Peter Štibraný Jan 31 '09 at 11:16
  • As suggested the decimal value of the character is 244. This is mysterious since it suggests that the garbling occurs during the sys.out call, or in the terminal itself. I know that it is not the terminal since i can cat the file and see its content no problem. Hmmm – Joel Jan 31 '09 at 12:33
  • @Joel: Any luck with System.console().printf(s) then? – Zach Scrivena Jan 31 '09 at 13:11
  • @zach - no i'm afraid it yields the same result. Oddly enough though I have noticed that setting -Dfile.encoding to UTF16 causes it to work, but not if set to UTF8. I do not understand why this would be, and it appears more of a hack than a fix. – Joel Jan 31 '09 at 13:14
  • @Joel: What if you redirect program output to a file, and then cat it? – Zach Scrivena Jan 31 '09 at 13:46
  • @Zach - same, the chars come out as ?. Also, if i put the debugger on it, they are also show a ?. I am most perplexed. – Joel Jan 31 '09 at 13:57
  • Please refer to my Answer below for the code I used to get this working. The suggestion in this post that the problem was due to the System.out call was correct. Thanks for all your help. – Joel Jan 31 '09 at 15:14
10

Parsing the file as fixed-size blocks of bytes is not good --- what if some character has a byte representation that straddles across two blocks? Use an InputStreamReader with the appropriate character encoding instead:

 BufferedReader br = new BufferedReader(
         new InputStreamReader(
         new FileInputStream("myfile.csv"), "ISO-8859-1");

 char[] buffer = new char[4096]; // character (not byte) buffer 

 while (true)
 {
      int charCount = br.read(buffer, 0, buffer.length);

      if (charCount == -1) break; // reached end-of-stream 

      String s = String.valueOf(buffer, 0, charCount);
      // alternatively, we can append to a StringBuilder

      System.out.println(s);
 }

Btw, remember to check that the unicode character can indeed be displayed correctly. You could also redirect the program output to a file and then compare it with the original file.

As Jon Skeet suggests, the problem may also be console-related. Try System.console().printf(s) to see if there is a difference.

Community
  • 1
  • 1
Zach Scrivena
  • 29,073
  • 11
  • 63
  • 73
6

@Joel - your own answer confirms that the problem is a difference between the default encoding on your operating system (UTF-8, the one Java has picked up) and the encoding your terminal is using (ISO-8859-1).

Consider this code:

public static void main(String[] args) throws IOException {
    byte[] data = { (byte) 0xF4 };
    String decoded = new String(data, "ISO-8859-1");
    if (!"\u00f4".equals(decoded)) {
        throw new IllegalStateException();
    }

    // write default charset
    System.out.println(Charset.defaultCharset());

    // dump bytes to stdout
    System.out.write(data);

    // will encode to default charset when converting to bytes
    System.out.println(decoded);
}

By default, my Ubuntu (8.04) terminal uses the UTF-8 encoding. With this encoding, this is printed:

UTF-8

If I switch the terminal's encoding to ISO 8859-1, this is printed:

UTF-8
ôô

In both cases, the same bytes are being emitted by the Java program:

5554 462d 380a f4c3 b40a

The only difference is in how the terminal is interpreting the bytes it receives. In ISO 8859-1, ô is encoded as 0xF4. In UTF-8, ô is encoded as 0xC3B4. The other characters are common to both encodings.

Community
  • 1
  • 1
McDowell
  • 107,573
  • 31
  • 204
  • 267
  • I am certainly missing something here - what is the `5554 462d 380a f4c3 b40a` dump ? Certainly not the `System.out.write(data)` call ? – Mr_and_Mrs_D Apr 10 '13 at 11:16
  • 1
    @Mr_and_Mrs_D These are the bytes the JRE writes to the device (STDOUT) with all three calls to `System.out`. The `0A` bytes mark the newlines written by `println`. _There was an answer written by the question author, since deleted, but I don't think being able to read it adds much._ – McDowell Apr 10 '13 at 15:18
  • Thanks for following up - I understood there was an answer by the author since deleted - can not read it - thanks :) – Mr_and_Mrs_D Apr 10 '13 at 21:22
3

If you can, try to run your program in debugger to see what's inside your 's' string after it is created. It is possible that it has correct content, but output is garbled after System.out.println(s) call. In that case, there is probably mismatch between what Java thinks is encoding of your output and character encoding of your terminal/console on Linux.

Peter Štibraný
  • 32,463
  • 16
  • 90
  • 116
1

Basically, if it works on your local XP PC but not on Linux, and you are parsing the exact same file (i.e. you transferred it in a binary fashion between the boxes), then it probably has something to do with the System.out.println call. I don't know how you verify the output, but if you do it by connecting with a remote shell from the XP box, then there is the character set of the shell (and the client) to consider.

Additionally, what Zach Scrivena suggests is also true - you cannot assume that you can create strings from chunks of data in that way - either use an InputStreamReader or read the complete data into an array first (obviously not going to work for a large file). However, since it does seem to work on XP, then I would venture that this is probably not your problem in this specific case.

Eek
  • 1,060
  • 8
  • 7