How to detect the character encoding of a file?

Question

Our application receives files from our users, and those files must be validated if they are of the encoding type that we support (i.e. UTF-8, Shift-JIS, EUC-JP), and once that file is validated, we would also need to save that file in our system and its encoding as meta-data.

Currently, we're using JCharDet (which is a java port of mozilla's character detector), but there are some Shift-JIS characters that it seems to fail to detect as valid Shift-JIS characters.

Any ideas what else we can use?

possible duplicate of [Java : How to determine the correct charset encoding of a stream](http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream) — Fabian Steeg, Sep 10 '10 at 12:22
How does the application receive files? If it is through HTTP, this should be stored in the mime headers. — Peter DeWeese, Sep 10 '10 at 12:26
@Peter: no, certainly not. The mime header only represents the encoding of the HTTP request body, not the file's original encoding. — BalusC, Sep 10 '10 at 14:35

score 2 · Answer 1 · answered Feb 02 '18 at 17:27

ICU4J's CharsetDetector will help you.

BufferedInputStream bis = new BufferedInputStream(new FileInputStream(path));
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
String charsetName = cd.detect().getName();

By the way, what kind of character had caused the error, and what kind of error had caused? I think ICU4J would have same problem, depending on the character and the error.

score 1 · Answer 2 · answered Sep 11 '10 at 13:04

Apache Tika is a content analysis toolkit that is mainly useful for determining file types — as opposed to encoding schemes — but it does returns content encoding information for text file types. I don't know if its algorithms are as advanced as JCharDet, but it might be worth a try...

How to detect the character encoding of a file?

2 Answers2