How to programatically identify the character set of a file?

Question

From a detailed perspective how does one identify the character set of a file? Some information I found was checking by the magic number of the file, but other articles I found strayed away from this.

I have tried opening different files encoded in different character sets (ASCII/UTF8 for example) with hexdump and there is no file identifier on what character set the file is.

Visit: [http://stackoverflow.com/questions/4520184/how-to-detect-the-character-encoding-of-a-text-file] [1]: http://stackoverflow.com/questions/4520184/how-to-detect-the-character-encoding-of-a-text-file — Hernaldo Gonzalez, Sep 24 '13 at 15:22
Guessing at text encodings doesn't work very well. You should try to avoid having to do it; Make the source of the data tell you the encoding. — bames53, Sep 27 '13 at 15:07

score 4 · Accepted Answer · answered Sep 24 '13 at 15:17

It is practically impossible to identify arbitrary character sets just by looking at a raw byte dump. Some character sets show typical patterns by which they can be identified, but that still doesn't make a clear match. The best you can do is typically to guess by exclusion, starting with character sets that have certain rules. If a file is not valid in UTF-8, then try Shift-JIS, then BIG-5 etc... The problem is that any file is valid in Latin-1 and other single byte encodings. That's what makes it so fundamentally impossible. It's also virtually impossible to distinguish any one single-byte charset from any other single-byte charset. In the end you'd have to employ text analysis to determine whether a decoded piece of text appears to make sense or whether it looks like gibberish and hence the encoding was likely incorrect.

In short: there's no foolproof way to detect character sets, period. You should always have metadata which specifies the charset.

score 1 · Answer 2 · answered Sep 24 '13 at 15:16

No.

I wrote a library that checked UTF-8 conformity (special bit syntax), and tried by keeping the 100 most frequent words per language to identify the language(s) and corresponding character encoding. The single byte encodings ISO-8859-* in general can be derived from the language content.

In general there is no magic cookie. UTF knows an optional BOM, which is more used for UTF-16 (Little Endian, Big Endian).

So maybe search for language recognizers.

score 0 · Answer 3 · answered Sep 24 '13 at 15:14

0

There is no way to do this reliably for all encodings and there is no universal magic number or identifier for this either. You can use heuristics for some encodings like UTF-8, but in most cases, you just have to know the encoding.

answered Sep 24 '13 at 15:14

Uku Loskit

40,868
9
92
93

How to programatically identify the character set of a file?

3 Answers3