1

In Java is there a way to detect if a file is ANSI or UTF-8? The problem i am having is that if someone creates a CSV file in Excel it's UTF-8. If they create it using note pad it's ANSI.

I am wondering if i can detect the type of file then handle it accordingly.

Thanks.

user1158745
  • 2,402
  • 9
  • 41
  • 60
  • 1
    Does this help? https://code.google.com/p/juniversalchardet/ – Martin Pfeffer Jan 13 '15 at 19:59
  • Check: http://stackoverflow.com/questions/3759356/what-is-the-most-accurate-encoding-detector – Leandro Carracedo Jan 13 '15 at 20:00
  • can you provide some code rather then just links? – user1158745 Jan 13 '15 at 20:02
  • You may be able to check for the UTF-8 BOM, if excel includes it (I don't have a copy here to check). You could open as binary, read the first three bytes and check for `0xEF,0xBB,0xBF`, or optimistically open as "Cp1252" ("ANSI") and if you see `` at the start, reopen it as UTF-8. – CupawnTae Jan 13 '15 at 20:18
  • 1
    @user1158745 Those links seems to be quite useful and provide code example. If you want you are allowed to post an answer to write answer to your own question. – NiematojakTomasz Jan 13 '15 at 20:18
  • @CupawanTae this answer seems more what i am looking for. It would be nice if java had a function to detect file enocding types. – user1158745 Jan 13 '15 at 20:27
  • The problem is there's no definitive way to do it. For example, a text file that says "Hello World" would be encoded exactly the same in ASCII, ANSI/Cp1252 and UTF-8 (if the Byte Order Mark isn't present in the UTF-8 file, which it often isn't), and it would therefore actually be impossible to detect the intended character encoding from that file. When the UTF-8 BOM is present, you can be reasonably sure it's UTF-8 but it's not a 100% guarantee (it could be a random binary file with those bytes at the start for example) – CupawnTae Jan 13 '15 at 20:38

1 Answers1

1

You could try something like this. It relies on Excel including a Byte Order Mark (BOM), which a quick search suggests it does although I can't verify it, and on the fact that java treats the BOM as a particular "character" \uFEFF.

FileInputStream fis = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));

String line = br.readLine();
if (line.startsWith("\uFEFF")) {
    // it's UTF-8, throw away the BOM character and continue
    line = line.substring(1);
} else {
    // it's not UTF-8, reopen
    br.close(); // also closes fis
    fis = new FileInputStream(file); // reopen from the start
    br = new BufferedReader(new InputStreamReader(fis, "Cp1252"));
    line = br.readLine();
}

// now line contains the first line, and br.readLine() will get the next

Some more information on the UTF-8 Byte Order Mark and detection of encoding at http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

CupawnTae
  • 14,192
  • 3
  • 29
  • 60