java detect if file is UTF-8 or Ansi

Question

In Java is there a way to detect if a file is ANSI or UTF-8? The problem i am having is that if someone creates a CSV file in Excel it's UTF-8. If they create it using note pad it's ANSI.

I am wondering if i can detect the type of file then handle it accordingly.

Thanks.

Does this help? https://code.google.com/p/juniversalchardet/ — Martin Pfeffer, Jan 13 '15 at 19:59
Check: http://stackoverflow.com/questions/3759356/what-is-the-most-accurate-encoding-detector — Leandro Carracedo, Jan 13 '15 at 20:00
You may be able to check for the UTF-8 BOM, if excel includes it (I don't have a copy here to check). You could open as binary, read the first three bytes and check for `0xEF,0xBB,0xBF`, or optimistically open as "Cp1252" ("ANSI") and if you see `ï»¿` at the start, reopen it as UTF-8. — CupawnTae, Jan 13 '15 at 20:18
@user1158745 Those links seems to be quite useful and provide code example. If you want you are allowed to post an answer to write answer to your own question. — NiematojakTomasz, Jan 13 '15 at 20:18
@CupawanTae this answer seems more what i am looking for. It would be nice if java had a function to detect file enocding types. — user1158745, Jan 13 '15 at 20:27
The problem is there's no definitive way to do it. For example, a text file that says "Hello World" would be encoded exactly the same in ASCII, ANSI/Cp1252 and UTF-8 (if the Byte Order Mark isn't present in the UTF-8 file, which it often isn't), and it would therefore actually be impossible to detect the intended character encoding from that file. When the UTF-8 BOM is present, you can be reasonably sure it's UTF-8 but it's not a 100% guarantee (it could be a random binary file with those bytes at the start for example) — CupawnTae, Jan 13 '15 at 20:38

CupawnTae · Answer 1 · 2015-01-14T18:23:21.920

You could try something like this. It relies on Excel including a Byte Order Mark (BOM), which a quick search suggests it does although I can't verify it, and on the fact that java treats the BOM as a particular "character" \uFEFF.

FileInputStream fis = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));

String line = br.readLine();
if (line.startsWith("\uFEFF")) {
    // it's UTF-8, throw away the BOM character and continue
    line = line.substring(1);
} else {
    // it's not UTF-8, reopen
    br.close(); // also closes fis
    fis = new FileInputStream(file); // reopen from the start
    br = new BufferedReader(new InputStreamReader(fis, "Cp1252"));
    line = br.readLine();
}

// now line contains the first line, and br.readLine() will get the next

Some more information on the UTF-8 Byte Order Mark and detection of encoding at http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

java detect if file is UTF-8 or Ansi

1 Answers1