37

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

Worth noting:

  • No additional meta-data is available. The byte array is literally the only available input.
  • The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.
knorv
  • 49,059
  • 74
  • 210
  • 294

7 Answers7

35

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

Stéphane Laurent
  • 75,186
  • 15
  • 119
  • 225
knorv
  • 49,059
  • 74
  • 210
  • 294
5

There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
4

Here's my favorite: https://github.com/codehaus/guessencoding

It works like this:

  • If there's a UTF-8 or UTF-16 BOM, return that encoding.
  • If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
  • If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
  • Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).

It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.

Peter
  • 5,556
  • 3
  • 23
  • 38
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
1

Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:

http://www.joelonsoftware.com/articles/Unicode.html

Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.

Rooke
  • 2,013
  • 3
  • 22
  • 34
0

Check out jchardet

Chi
  • 22,624
  • 6
  • 36
  • 37
-1

Should be stuff already available

google search turned up icu4j

or

http://jchardet.sourceforge.net/

gomesla
  • 76
  • 1
  • 1
  • 5
  • 3
    I kind of know how to use Google, but the question specifically asks for "what is the best way [..]". So which is best, icu4j, jchardet or some other library? – knorv Nov 05 '09 at 05:50
-1

Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,

How to determine if a String contains invalid encoded characters

Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.

Community
  • 1
  • 1
ZZ Coder
  • 74,484
  • 29
  • 137
  • 169
  • What about the cases where it is not UTF-8? – knorv Nov 05 '09 at 05:45
  • If it's not UTF-8, blindly calling it Latin-1 isn't a good idea. It would be better to use ICU, jchardet, or one of the other tools listed on this page to make an intelligent guess. – james.garriss Aug 06 '15 at 17:28