What is the most accurate encoding detector?

Question

After certain survey, I come to discover that there are a few encoding detection project in java world, if the getEncoding in InputStreamReader does not work:

However, I really do not know which is the best among the all. Can anyone with hand-on experience tell me which one is the best in Java?

Note that InputStreamReader.getEncoding() simply returns the encoding passed in the constructor, or the platform default encoding, it doesn't do anything with the stream data. — Michael Borgwardt, Sep 21 '10 at 10:27
Thanks! I am aware of it. That's why I am so eager to figure out which one the best is. — Winston Chen, Sep 21 '10 at 10:30
There is also Apache Tika, which seems to be based on ICU4J. — Thomas Mueller, Sep 21 '10 at 11:53
FWIW, ICU4J comes with codepage conversion as well, and so can perform conversion from an updated set of encodings that are detected. — Steven R. Loomis, Sep 21 '10 at 19:14

yishaiz · Answer 1 · 2015-02-05T16:33:14.243

I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent: juniversalchardet had better results:

UTF-8: Both detected.
Windows-1255: juniversalchardet detected when it had enough hebrew letters, ICU4J still thought it was ISO-8859-1. With even more hebrew letters, ICU4J detected it as ISO-8859-8 which is the other hebrew encoding(and so the text was OK).
SHIFT_JIS(Japanese): juniversalchardet detected and ICU4J thought it was ISO-8859-2.
ISO-8859-1: detected by ICU4J, not supported by juniversalchardet.

So one should consider which encodings he will most likely have to deal with. In the end I chose ICU4J.

Notice that ICU4J is still maintained.

Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.

AutoDetectReader of Apache Tika does exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).

score 4 · Answer 2 · answered Oct 01 '10 at 07:17

4

I found an answer online:

http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

It says something vealuable here:

The strength of a character encoding detector lies in whether or not its focus is on statistical analysis or HTML META and XML prolog discovery. If you are processing HTML files that have META, use cpdetector. Otherwise, your best bet is either monq.stuff.EncodingDetector or com.sun.syndication.io.XmlReader.

So that's why I am using cpdetector now. I will update the post with the result of it.

answered Oct 01 '10 at 07:17

Winston Chen

6,799
12
52
81

1

Do you only care about files that already are tagged with the charset via XML or META? That test is very, very suspect (so much so that I ran it myself). The test files it uses are not real content, but they are code charts. I.e., they are not "text in encoding X" but "text in English with a list of the code points in encoding X". However, all test files are tagged with the encoding. A comparison should be done, but not with these test files. – Steven R. Loomis Oct 01 '10 at 22:45
2

Further testing: I ran the test case in that blog against the same detectors (latest versions) on untagged data. ONLY icu detected: euc-jp, iso-2022-jp, koi8-r, iso-2022-cn iso-2022-kr.... Only ICU and Mozilla jchardet detected: shift-jis, gb18030, big5... I used samples from http://source.icu-project.org/repos/icu/icu/trunk/source/extra/uconv/samples/ and the utf-8 directory (some converted from files there into the target codepage). – Steven R. Loomis Oct 01 '10 at 23:37

score 1 · Answer 3 · answered Sep 23 '10 at 09:58

1

I've personally used jchardet in our project (juniversalchardet wasn't available back then) just to check if a stream was UTF-8 or not.

It was easier to integrate with our application than the other and yielded great results.

answered Sep 23 '10 at 09:58

fglez

8,422
4
47
78

What is the most accurate encoding detector?

3 Answers3

Linked