Java : How to determine the correct charset encoding of a stream

Question

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct charset encoding of an inputstream/file ?

I have tried using the following:

File in =  new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());

But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.

Eduard is right, "You cannot determine the encoding of a arbitrary byte stream". All other proposals give you ways (and libraries) to do best guessing. But in the end they are still guesses. — Mihai Nita, Dec 15 '11 at 12:13
`Reader.getEncoding` returns the encoding the reader was set up to use, which in your case is the default encoding. — Karol S, Sep 27 '14 at 18:44
```System.getProperty("file.encoding")``` it returns string. ex - ```FileInputStream fis = new FileInputStream(path); String encoding = System.getProperty("fis.encoding");``` — Sathvik, Nov 10 '20 at 15:50

score 108 · Answer 1 · edited Mar 10 '14 at 19:51

108

You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.

The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.

Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.

Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.

Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.

edited Mar 10 '14 at 19:51

Agostino

2,723
9
48
65

answered Jan 31 '09 at 15:44

Eduard Wirch

9,785
9
61
73

20

This doesn't really answer the question. The op should probably be using http://docs.codehaus.org/display/GUESSENC/Home or http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html or http://jchardet.sourceforge.net/ – Christoffer Hammarström Dec 15 '10 at 10:23
27

So how does my editor, notepad++ know how to open the file and show me the right characters ? – mjs Dec 20 '11 at 14:51
12

@Hamidam it is by luck that it shows you the right characters. When it guesses wrongly (and it often does), there is an option (Menu >> Encoding) that allows you to change the encoding. – Pacerier Jan 17 '12 at 09:51
17

@Eduard: "So every encoding "could" be the right." not quite right. Many text encodings have several patterns that are invalid, which are a flag that the text is _probably_ not that encoding. In fact, given the the first two bytes of a file, only 38% of the combinations are valid UTF8. The odds of the first 5 codepoints being valid UTF8 by chance is less than .77%. Likewise, UTF16BE and LE are usually easily identified by the large number of zero bytes and where they are. – Mooing Duck Dec 06 '12 at 18:32
It would be nice to be able to get at least as accurate a method as Notepad++ or just plain Notepad. Can nobody tell us what that is? – Emperor Eto Aug 07 '20 at 14:27

score 79 · Accepted Answer · edited Jun 29 '21 at 15:06

79

I have used this library, similar to jchardet for detecting encoding in Java: https://github.com/albfernandez/juniversalchardet

edited Jun 29 '21 at 15:06

Kalle Richter

8,008
26
77
177

answered Jan 19 '11 at 13:44

Luciano Fiandesio

10,037
10
48
56

7

I found that this was more accurate: http://jchardet.sourceforge.net/ (I was testing on Western European language documents encoded in ISO 8859-1 , windows-1252, utf-8) – Joel Apr 06 '11 at 09:42
2

This juniversalchardet does not work. It delivers UTF-8 most of time, even if the file is 100% windows-1212 encoded. – Brain Sep 11 '16 at 18:28
It does not detect Eastern European windows-1250 – Bernhard Döbler Aug 01 '18 at 09:48
I tried following code snippet for detection on file from "https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt" but got null as detected character set. UniversalDetector ud = new UniversalDetector(null); byte[] bytes = FileUtils.readFileToByteArray(new File(file)); ud.handleData(bytes, 0, bytes.length); ud.dataEnd(); detectedCharset = ud.getDetectedCharset(); – Rohit Verma Sep 14 '18 at 12:00
2

Juniversalchardet doesn't support ISO-8859-1, one of the most common charsets. – Thomas Jun 15 '21 at 11:23

score 40 · Answer 3 · edited Dec 17 '11 at 00:41

40

check this out: http://site.icu-project.org/ (icu4j) they have libraries for detecting charset from IOStream could be simple like this:

BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();

if (cm != null) {
   reader = cm.getReader();
   charset = cm.getName();
}else {
   throw new UnsupportedCharsetException()
}

edited Dec 17 '11 at 00:41

Maxim Veksler

29,272
38
131
151

answered Oct 25 '10 at 10:11

user345883

413
4
5

2

i tried but it greatly fails: i made 2 text files in eclipse both containing "öäüß". One set to iso encoding and one to utf8 - both are detected as utf8! So i tried a file safed somewhere on my hd (windows) - this one was detected correctly ("windows-1252"). Then i created two new file on hd one one edited with editor the other one with notepad++. in both cases "Big5" (Chinese) was detected! – dermoritz Sep 29 '11 at 07:13
2

EDIT: Ok i should check cm.getConfidence() - with my short "äöüß" the confidence is 10. So i have to decide what confidence is good enought - but thats absolutly ok for this endeavour (charset detection) – dermoritz Sep 29 '11 at 07:21
2

Direct link to sample code: http://userguide.icu-project.org/conversion/detection – james.garriss Sep 23 '15 at 13:36
The main issue with using ICU4J for character set detection is that the JAR weighs in at 13MB. I've extracted the chardet feature from ICU4J and packaged it into a standalone 75KB library at https://github.com/sigpwned/chardet4j. Same code, smaller footprint. – sigpwned May 02 '22 at 18:54

Benny Code · Answer 4 · 2021-11-28T17:21:46.957

31

Here are my favorites:

TikaEncodingDetector

Dependency:

<dependency>
  <groupId>org.apache.any23</groupId>
  <artifactId>apache-any23-encoding</artifactId>
  <version>1.1</version>
</dependency>

Sample:

public static Charset guessCharset(InputStream is) throws IOException {
  return Charset.forName(new TikaEncodingDetector().guessEncoding(is));    
}

GuessEncoding

Dependency:

<dependency>
  <groupId>org.codehaus.guessencoding</groupId>
  <artifactId>guessencoding</artifactId>
  <version>1.4</version>
  <type>jar</type>
</dependency>

Sample:

  public static Charset guessCharset2(File file) throws IOException {
    return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
  }

edited Nov 28 '21 at 17:21

answered Nov 30 '14 at 12:48

Benny Code

51,456
28
233
198

4

*Nota:* **TikaEncodingDetector 1.1** is actually a thin wrapper around **ICU4J 3.4** `CharsetDectector` class. – Stephan Sep 01 '15 at 17:26
Unfortunately both libs do not work. In one case it identifies a UTF-8 file with german Umlaute as ISO-8859-1 and US-ASCII. – Brain Sep 12 '16 at 11:48
1

@Brain: Is your tested file actually in an UTF-8 format and does it include a BOM (https://en.wikipedia.org/wiki/Byte_order_mark)? – Benny Code Sep 12 '16 at 11:53
@BennyNeugebauer the file is a UTF-8 without BOM. I checked it with Notepad++, also by changing the encoding and asserting that the "Umlaute" still are visible. – Brain Sep 14 '16 at 07:19

score 15 · Answer 5 · edited Jun 20 '20 at 09:12

Which library to use?

As of this writing, they are three libraries that emerge:

I don't include Apache Any23 because it uses ICU4j 3.4 under the hood.

How to tell which one has detected the right charset (or as close as possible)?

It's impossible to certify the charset detected by each above libraries. However, it's possible to ask them in turn and score the returned response.

How to score the returned response?

Each response can be assigned one point. The more points a response have, the more confidence the detected charset has. This is a simple scoring method. You can elaborate others.

Is there any sample code?

Here is a full snippet implementing the strategy described in the previous lines.

public static String guessEncoding(InputStream input) throws IOException {
    // Load input data
    long count = 0;
    int n = 0, EOF = -1;
    byte[] buffer = new byte[4096];
    ByteArrayOutputStream output = new ByteArrayOutputStream();

    while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
        output.write(buffer, 0, n);
        count += n;
    }
    
    if (count > Integer.MAX_VALUE) {
        throw new RuntimeException("Inputstream too large.");
    }

    byte[] data = output.toByteArray();

    // Detect encoding
    Map<String, int[]> encodingsScores = new HashMap<>();

    // * GuessEncoding
    updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());

    // * ICU4j
    CharsetDetector charsetDetector = new CharsetDetector();
    charsetDetector.setText(data);
    charsetDetector.enableInputFilter(true);
    CharsetMatch cm = charsetDetector.detect();
    if (cm != null) {
        updateEncodingsScores(encodingsScores, cm.getName());
    }

    // * juniversalchardset
    UniversalDetector universalDetector = new UniversalDetector(null);
    universalDetector.handleData(data, 0, data.length);
    universalDetector.dataEnd();
    String encodingName = universalDetector.getDetectedCharset();
    if (encodingName != null) {
        updateEncodingsScores(encodingsScores, encodingName);
    }

    // Find winning encoding
    Map.Entry<String, int[]> maxEntry = null;
    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
            maxEntry = e;
        }
    }

    String winningEncoding = maxEntry.getKey();
    //dumpEncodingsScores(encodingsScores);
    return winningEncoding;
}

private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
    String encodingName = encoding.toLowerCase();
    int[] encodingScore = encodingsScores.get(encodingName);

    if (encodingScore == null) {
        encodingsScores.put(encodingName, new int[] { 1 });
    } else {
        encodingScore[0]++;
    }
}    

private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
    System.out.println(toString(encodingsScores));
}

private static String toString(Map<String, int[]> encodingsScores) {
    String GLUE = ", ";
    StringBuilder sb = new StringBuilder();

    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
    }
    int len = sb.length();
    sb.delete(len - GLUE.length(), len);

    return "{ " + sb.toString() + " }";
}

Improvements: The guessEncoding method reads the inputstream entirely. For large inputstreams this can be a concern. All these libraries would read the whole inputstream. This would imply a large time consumption for detecting the charset.

It's possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only.

Zach Scrivena · Answer 6 · 2009-02-01T07:44:18.177

You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for "malformed-input" or "unmappable-character" errors. Of course, this only tells you if a charset is wrong; it doesn't tell you if it is correct. For that, you need a basis of comparison to evaluate the decoded results, e.g. do you know beforehand if the characters are restricted to some subset, or whether the text adheres to some strict format? The bottom line is that charset detection is guesswork without any guarantees.

faghani · Answer 7 · 2016-06-08T20:29:08.157

As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...

I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.

Bellow I provided some helpful comments which I’ve experienced in my work:

Charset detection is not a foolproof process, because it is essentially based on statistical data and what actually happens is guessing not detecting
icu4j is the main tool in this context by IBM, imho
Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests (at most %1, as I remember)
icu4j is much more general than jchardet, icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8
Due to the widespread use of UTF-8 in HTML-world; jchardet is a better choice than icu4j in overall, but is not the best choice!
icu4j is great for East Asian specific encodings like EUC-KR, EUC-JP, SHIFT_JIS, BIG5 and the GB family encodings
Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings. Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic
Almost all encoding detection tools are using statistical methods, so the accuracy of output strongly depends on the size and the contents of the input
Some encodings are essentially the same just with a partial differences, so in some cases the guessed or detected encoding may be false but at the same time be true! As about Windows-1252 and ISO-8859-1. (refer to the last paragraph under the 5.2 section of my paper)

The question is cluttered with really bad and duplicative answers. Thank you for the best answer by far. — Douglas Held, Jan 25 '23 at 10:58
@DouglasHeld Glad it helped. This thread is a good example of [matthew effect](https://en.wikipedia.org/wiki/Matthew_effect) in stackoverflow! — faghani, Apr 03 '23 at 12:22

score 6 · Answer 8 · answered Feb 15 '10 at 11:53

6

The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text

answered Feb 15 '10 at 11:53

Lorrat

77
1
1

21

just at tip, but there is no "above" on this site - consider stating the libraries you are referring to. – McDowell Jan 19 '11 at 14:02

score 5 · Answer 9 · edited Jan 21 '20 at 13:27

If you use ICU4J (http://icu-project.org/apiref/icu4j/)

Here is my code:

String charset = "ISO-8859-1"; //Default chartset, put whatever you want

byte[] fileContent = null;
FileInputStream fin = null;

//create FileInputStream object
fin = new FileInputStream(file.getPath());

/*
 * Create byte array large enough to hold the content of the file.
 * Use File.length to determine size of the file in bytes.
 */
fileContent = new byte[(int) file.length()];

/*
 * To read content of the file in byte array, use
 * int read(byte[] byteArray) method of java FileInputStream class.
 *
 */
fin.read(fileContent);

byte[] data =  fileContent;

CharsetDetector detector = new CharsetDetector();
detector.setText(data);

CharsetMatch cm = detector.detect();

if (cm != null) {
    int confidence = cm.getConfidence();
    System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
    //Here you have the encode name and the confidence
    //In my case if the confidence is > 50 I return the encode, else I return the default value
    if (confidence > 50) {
        charset = cm.getName();
    }
}

Remember to put all the try-catch need it.

I hope this works for you.

IMO, this answer is perfectible. If you want to use ICU4j, try this one instead: http://stackoverflow.com/a/4013565/363573. — Stephan, Sep 01 '15 at 15:26

score 4 · Answer 10 · answered Jan 07 '10 at 09:04

4

I found a nice third party library which can detect actual encoding: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

I didn't test it extensively but it seems to work.

answered Jan 07 '10 at 09:04

falcon

49
1
1

1

The Link to the "GuessEncoding" project website is: https://xircles.codehaus.org/p/guessencoding – Benny Code Oct 26 '14 at 23:16

score 4 · Answer 11 · edited May 23 '17 at 12:03

4

If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.

edited May 23 '17 at 12:03

Community

1
1

answered Jan 31 '09 at 15:46

Fabian Steeg

44,988
7
85
112

score 1 · Answer 12 · edited Sep 03 '15 at 09:47

1

An alternative to TikaEncodingDetector is to use Tika AutoDetectReader.

Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();

edited Sep 03 '15 at 09:47

Stephan

41,764
65
238
329

answered May 11 '15 at 13:04

Nolf

113
1
10

Tike AutoDetectReader uses EncodingDetector loaded with ServiceLoader. Which EncodingDetector implementations do you use? – Stephan Sep 03 '15 at 09:48

score 1 · Answer 13 · edited Dec 10 '11 at 01:03

For ISO8859_1 files, there is not an easy way to distinguish them from ASCII. For Unicode files however one can generally detect this based on the first few bytes of the file.

UTF-8 and UTF-16 files include a Byte Order Mark (BOM) at the very beginning of the file. The BOM is a zero-width non-breaking space.

Unfortunately, for historical reasons, Java does not detect this automatically. Programs like Notepad will check the BOM and use the appropriate encoding. Using unix or Cygwin, you can check the BOM with the file command. For example:

$ file sample2.sql 
sample2.sql: Unicode text, UTF-16, big-endian

For Java, I suggest you check out this code, which will detect the common file formats and select the correct encoding: How to read a file and automatically specify the correct encoding

Not all UTF-8 or UTF-16 files have a BOM, as it is not required, and UTF-8 BOM is discouraged. — Christoffer Hammarström, Oct 11 '11 at 13:33

score 0 · Answer 14 · answered Mar 11 '21 at 00:01

A good strategy to handle this, is with a way to auto detect the input charset.

I use org.xml.sax.InputSource in Java 11 to solve it:

...    
import org.xml.sax.InputSource;
...

InputSource inputSource = new InputSource(inputStream);
inputStreamReader = new InputStreamReader(
    inputSource.getByteStream(), inputSource.getEncoding()
  );

Input sample:

<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...**strong text**

Andres · Answer 15 · 2018-07-30T14:12:56.577

-1

In plain Java:

final String[] encodings = { "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16" };

List<String> lines;

for (String encoding : encodings) {
    try {
        lines = Files.readAllLines(path, Charset.forName(encoding));
        for (String line : lines) {
            // do something...
        }
        break;
    } catch (IOException ioe) {
        System.out.println(encoding + " failed, trying next.");
    }
}

This approach will try the encodings one by one until one works or we run out of them. (BTW my encodings list has only those items because they are the charsets implementations required on every Java platform, https://docs.oracle.com/javase/9/docs/api/java/nio/charset/Charset.html)

edited Jul 30 '18 at 14:12

answered Jul 28 '18 at 16:59

Andres

1,090
14
30

But ISO-8859-1 (among many others that you haven't listed) will always succeed. And, of course, this is just guessing, which can't recover the lost metadata that is essential to text file communication. – Tom Blodget Jul 29 '18 at 12:33
Hi @TomBlodget, are you suggesting that the encodings order should be different? – Andres Jul 30 '18 at 14:19
4

I saying that many will "work" but only one is "right". And you don't need to test for ISO-8859-1 because it will always "work". – Tom Blodget Jul 30 '18 at 14:22

score -12 · Answer 16 · answered Jan 31 '09 at 15:44

-12

Can you pick the appropriate char set in the Constructor:

new InputStreamReader(new FileInputStream(in), "ISO8859_1");

answered Jan 31 '09 at 15:44

Kevin

30,111
9
76
83

8

The point here was to see whether the charset could be determined programatically. – Joel Jan 31 '09 at 15:46
1

No, it won't guess it for you. You have to supply it. – Kevin Jan 31 '09 at 15:50
1

There may be a heuristic method, as suggested by some of the answers here http://stackoverflow.com/questions/457655/java-charset-and-windows/457849#457849 – Joel Jan 31 '09 at 15:56

Java : How to determine the correct charset encoding of a stream

16 Answers16

Which library to use?

How to tell which one has detected the right charset (or as close as possible)?

How to score the returned response?

Is there any sample code?

Linked

Related