Guessing the encoding of text represented as byte[] in Java

Question

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

Worth noting:

No additional meta-data is available. The byte array is literally the only available input.
The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.

http://stackoverflow.com/questions/373081/ maybe of help – Chris Nov 05 '09 at 00:00 — Chris, Nov 05 '09 at 00:00

score 35 · Accepted Answer · edited Oct 24 '17 at 14:39

35

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

edited Oct 24 '17 at 14:39

Stéphane Laurent

75,186
15
119
225

answered Nov 05 '09 at 07:04

knorv

49,059
74
210
294

my project requirement is if the data is not in utf8 (after detection) then convert it to utf8, how to do this ? – coding_idiot Feb 21 '13 at 03:08
@coding_idiot use the "guessed" encoding to convert to a String then get utf-8 bytes: `new String(bytes, guessedEncoding).getBytes("utf-8")`. – Brett Okken Jun 19 '14 at 15:05
Not very happy with this. See https://github.com/albfernandez/juniversalchardet/issues/22 – Sxilderik Jan 09 '18 at 14:05
juniversalchardet is also available in maven. groupId: com.googlecode.juniversalchardet, artifactId: juniversalchardet. – Aleksandr Erokhin Nov 09 '18 at 09:03

score 5 · Answer 2 · answered Sep 20 '10 at 12:38

5

There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

answered Sep 20 '10 at 12:38

Thomas Mueller

48,905
14
116
132

score 4 · Answer 3 · edited Oct 21 '18 at 17:34

Here's my favorite: https://github.com/codehaus/guessencoding

It works like this:

If there's a UTF-8 or UTF-16 BOM, return that encoding.
If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).

It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.

score 1 · Answer 4 · answered Nov 05 '09 at 01:01

Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:

http://www.joelonsoftware.com/articles/Unicode.html

Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.

score 0 · Answer 5 · answered Nov 05 '09 at 00:24

0

Check out jchardet

answered Nov 05 '09 at 00:24

Chi

22,624
6
36
37

7

Please elaborate - why do you consider jchardet to be the best library around? – knorv Nov 05 '09 at 05:51
@chi how to convert to utf8 if the encoding is not utf8. – coding_idiot Feb 21 '13 at 03:10

score -1 · Answer 6 · answered Nov 05 '09 at 01:00

-1

Should be stuff already available

google search turned up icu4j

or

http://jchardet.sourceforge.net/

answered Nov 05 '09 at 01:00

gomesla

76
1
1
5

3

I kind of know how to use Google, but the question specifically asks for "what is the best way [..]". So which is best, icu4j, jchardet or some other library? – knorv Nov 05 '09 at 05:50

score -1 · Answer 7 · edited May 23 '17 at 12:10

-1

Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,

How to determine if a String contains invalid encoded characters

Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.

edited May 23 '17 at 12:10

Community

1
1

answered Nov 05 '09 at 01:28

ZZ Coder

74,484
29
137
169

What about the cases where it is not UTF-8? – knorv Nov 05 '09 at 05:45
If it's not UTF-8, blindly calling it Latin-1 isn't a good idea. It would be better to use ICU, jchardet, or one of the other tools listed on this page to make an intelligent guess. – james.garriss Aug 06 '15 at 17:28

Guessing the encoding of text represented as byte[] in Java

7 Answers7

Linked

Related