1

This is my code:

    StringBuffer fileData = new StringBuffer(1000);
    BufferedReader reader = new BufferedReader(new FileReader(file));
    char[] buf = new char[5000];
    int numRead=0;
    while((numRead=reader.read(buf)) != -1){
        String readData = String.valueOf(buf, 0, numRead);
        fileData.append(readData);
        buf = new char[1024];
    }
    reader.close();
    return fileData.toString();

When I run it in windows, everything is fine.

But when I run in UNIX in the beginning of the string I see:



What can be the issue?

yuris
  • 1,109
  • 4
  • 19
  • 33

2 Answers2

3

The hexdump of the given char sequence is probably ef bb bf. I said probably, as I had to guess your display encoding.

If that is correct, you are trying to read as ISO-8859-X an UTF-8 encoded file with BOM prefix . That would be coherent with the fact you didn't see those chars when opening the file with vi/vim. Most if not all UTF-8 aware text editor know how to deal with the BOM.

From Java, you have to skip with it manually (don't know why it works on Windows though). An other option is to save your text file as UTF-8 without BOM.

This has already been discussed. See for example:



As this is not really clear, I've made the following experiment: I've created two files, utf-8 encoded and containing the string "L'élève va à l'école." The only difference between those two test files is one has a BOM prefix.

Then, based on the code given by the OP and a suggestion by Thomas Mueller, I wrote a very simple Java app to read those files using various encoding. Here is the code:

public class EncodingTest {
    public static String read(String file, String encoding) throws IOException {
        StringBuffer fileData = new StringBuffer(1000);

        /* Only difference with OP code */
        /* I use *explicit* encoding while reading the file */
        BufferedReader reader = new BufferedReader(
                new InputStreamReader(new FileInputStream(file), encoding)
                );

        char[] buf = new char[5000];
        int numRead=0;
        while((numRead=reader.read(buf)) != -1){
            String readData = String.valueOf(buf, 0, numRead);
            fileData.append(readData);
            buf = new char[1024];
        }
        reader.close();
        return fileData.toString();     
    }

    public static void main(String[] args) throws IOException {
        System.out.print(read("UTF-8-BOM-FILE", "UTF-8"));
        System.out.print(read("UTF-8-FILE", "UTF-8"));
        System.out.print(read("UTF-8-BOM-FILE", "ISO-8859-15"));
        System.out.print(read("UTF-8-FILE", "ISO-8859-15"));
    }
}

When I run this on my Linux system whom console encoding is UTF8, I've obtain the following results:

$ java -cp bin EncodingTest
L'élève va à l'école.
L'élève va à l'école.
L'élÚve va à l'école.
L'élÚve va à l'école.

Notice how the third line starts by the exact same sequence as given by the OP. That is while reading utf8 encoded file with BOM as iso-8859-15.

Surprisingly enough, the first two lines seems to be the same, like if Java had magically remove the BOM. I guess this is what is appending for the OP on Windows.

But, a closer inspection showed that:

$ java -cp bin EncodingTest | hexdump -C
00000000  ef bb bf 4c 27 c3 a9 6c  c3 a8 76 65 20 76 61 20  |...L'..l..ve va |
00000010  c3 a0 20 6c 27 c3 a9 63  6f 6c 65 2e 0a 4c 27 c3  |.. l'..cole..L'.|
00000020  a9 6c c3 a8 76 65 20 76  61 20 c3 a0 20 6c 27 c3  |.l..ve va .. l'.|
00000030  a9 63 6f 6c 65 2e 0a c3  af c2 bb c2 bf 4c 27 c3  |.cole........L'.|
00000040  83 c2 a9 6c c3 83 c5 a1  76 65 20 76 61 20 c3 83  |...l....ve va ..|
00000050  c2 a0 20 6c 27 c3 83 c2  a9 63 6f 6c 65 2e 0a 4c  |.. l'....cole..L|
00000060  27 c3 83 c2 a9 6c c3 83  c5 a1 76 65 20 76 61 20  |'....l....ve va |
00000070  c3 83 c2 a0 20 6c 27 c3  83 c2 a9 63 6f 6c 65 2e  |.... l'....cole.|
00000080  0a                                                |.|
00000081

Please notice the first three bytes: the BOM was send to output -- but my console somehow discarded them. However, from Java program perspective, those bytes where presents -- and I should probably have take care them manually.


So, what is the moral of all this? The OP has really two issues: A BOM prefixed UTF8 encoded file. And that file is read as iso-8859-X.

Yuris, in order to fix that, you have to explicitly use the correct encoding in your Java program, and either discard the first 3 bytes or change your data file to remove the BOM.

Community
  • 1
  • 1
Sylvain Leroux
  • 50,096
  • 7
  • 103
  • 125
  • Hm, if it's a BOM issue (that's possible), how come it only shows up in Linux but not in Windows? Assuming it's the exact same file... – Thomas Mueller Jul 23 '14 at 06:35
  • @ThomasMueller It is quite easy to know for sure if it is a BOM issue: the OP has only to hexdump the first few bytes of the "problematic" text file. Regarding Windows, I admit this is rather strange. Here again a simple hex dump of the file on the windows host might give some clues on what's going on. – Sylvain Leroux Jul 23 '14 at 08:52
  • @ThomasMueller To be more precise, I guess the OP has _two_ issues. A BOM prefixed UTF8 encoded file _and_ the attempt to read it as ISO-8859-X encoding. On windows, maybe the OP reads its BOM prefixed UTF8 file as UTF8, so the 3 first bytes are converted to the single unicode character `FEFF`. Maybe that character is silently dropped by some call on the road or at display? – Sylvain Leroux Jul 23 '14 at 09:27
  • yes that makes sense. I will remove my answer, as your answer addresses both the BOM and the encoding issue. – Thomas Mueller Jul 23 '14 at 12:05
0
byte[] content = Files.readAllBytes(file.toPath());
String s = new String(bytes, StandardCharsets.UTF_8);
s = s.replaceFirst("^\uFEFF", ""); // Remove starting BOM
return s;

The BOM is a byte order marker, used optionally as first char, zero-width space, to mark a file as say UTF-8 (or UTF-16LE, UTF-16BE).

Its main use seems to be on Windows for NotePad to not mixup the text with ANSI encoding.

FileReader is a utility class which cannot set the encoding.

It might be that your file already suffered a wrong encoding conversion. Maybe an UTF-8 text got pasted into an ANSI single-byte encoding text, or whatever.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138