4

Our application receives files from our users, and those files must be validated if they are of the encoding type that we support (i.e. UTF-8, Shift-JIS, EUC-JP), and once that file is validated, we would also need to save that file in our system and its encoding as meta-data.

Currently, we're using JCharDet (which is a java port of mozilla's character detector), but there are some Shift-JIS characters that it seems to fail to detect as valid Shift-JIS characters.

Any ideas what else we can use?

Zong
  • 6,160
  • 5
  • 32
  • 46
Franz See
  • 3,282
  • 5
  • 41
  • 48
  • 1
    possible duplicate of [Java : How to determine the correct charset encoding of a stream](http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream) – Fabian Steeg Sep 10 '10 at 12:22
  • How does the application receive files? If it is through HTTP, this should be stored in the mime headers. – Peter DeWeese Sep 10 '10 at 12:26
  • @Peter: no, certainly not. The mime header only represents the encoding of the HTTP request body, not the file's original encoding. – BalusC Sep 10 '10 at 14:35

2 Answers2

2

ICU4J's CharsetDetector will help you.

BufferedInputStream bis = new BufferedInputStream(new FileInputStream(path));
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
String charsetName = cd.detect().getName();

By the way, what kind of character had caused the error, and what kind of error had caused? I think ICU4J would have same problem, depending on the character and the error.

SATO Yusuke
  • 1,600
  • 15
  • 39
1

Apache Tika is a content analysis toolkit that is mainly useful for determining file types — as opposed to encoding schemes — but it does returns content encoding information for text file types. I don't know if its algorithms are as advanced as JCharDet, but it might be worth a try...

gutch
  • 6,959
  • 3
  • 36
  • 53