3

What should I use to read text files for which I don't know their encoding (ASCII or Unicode)?

Is there some class that auto-detects the encoding?

tshepang
  • 12,111
  • 21
  • 91
  • 136
angela d
  • 725
  • 1
  • 8
  • 12
  • What text editor are you using? – Swiss Oct 24 '11 at 08:02
  • using a C++ class, not in a text editor – angela d Oct 24 '11 at 08:05
  • 4
    @angela: This is impossible to do reliably. The encoding tells you how to interpret that data. There is no easy way for a computer to tell whether a certain interpretation is correct (even for humans that can be a very hard task). There are heuristics that can help somewhat, but they are not 100% reliable. – Björn Pollex Oct 24 '11 at 08:07
  • 2
    possible duplicate of [How to determine codepage of a file (that had some codepage transformation applied to it)](http://stackoverflow.com/questions/6957956/how-to-determine-codepage-of-a-file-that-had-some-codepage-transformation-appli) – Jan Hudec Oct 24 '11 at 08:22
  • 1
    [Bush hid the facts is a common name for a bug present in the function IsTextUnicode of Microsoft Windows, which causes a file of text encoded in Windows-1252 or similar encoding to be interpreted as if it were UTF-16LE, resulting in mojibake. When "Bush hid the facts" (without newline) is put in a new Notepad document and saved, closed, and reopened, the words "畂桳栠摩琠敨映捡獴" (Liu Benrenmotian Touyingjianmeng) appear instead.](http://en.wikipedia.org/wiki/Bush_hid_the_facts) – Alexey Frunze Oct 24 '11 at 09:03

3 Answers3

6

I can only give a negative answer here: There is no universally correct way to determine the encoding of a file. An ASCII file can be read as a ISO-8859-15 encoding, because ASCII is a subset. Even worse for other files may be valid in two different encodings having different meanings in both. So you need to get this information via some other means. In many cases it is a good approach to just assume that everything is UTF8. If you are working on a *NIX environment the LC_CTYPE variable may be helpful. If you do not care about the encoding (e.g. you do not change or process the content) you can open files as binary.

Helmut Grohne
  • 6,578
  • 2
  • 31
  • 67
  • In many cases, you can't even tell what language a (sufficiently short) snippet of text is in, even if you know the encoding :) – Karl Knechtel Oct 24 '11 at 08:52
1

This is impossible in the general case. If the file contains exactly the bytes I'm typing here, it is equally valid as ASCII, UTF-8 or any of the ISO 8859 variants. Several heuristics can be used as a guess, however: read the first "page" (512 bytes or so), then, in the following order:

  1. See if the block starts with a BOM in one of the Unicode formats
  2. Look at the first four bytes. If they contain `'\0'`, you're probably dealing with some form of UTF-16 or UTF-32, according to the following pattern: '\0', other, '\0', other UTF16BE other, '\0', other, '\0' UTF16LE '\0', '\0', '\0', other UTF32BE other, '\0', '\0', '\0' UTF32RLE
  3. Look for a byte with the top bit set. If it's the start of a legal UTF-8 character, then the file is probably in UTF-8. Otherwise... in the regions where I've worked, ISO 8859-1 is generally the best guess.
  4. Otherwise, you more or less have to assume ASCII, until you encounter a byte with the top bit set (at which point, you use the previous heuristic).

But as I said, it's not 100% sure.

(PS. How do I format a table here. The text in point 2 is declared as an HTML table, but it doesn't seem to be showing up as one.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
0

One of the ways(brute force) of doing can be

  • Built a list of suitable encodings (only iso-codepages and unicode)
  • Iterate over all considered encodings
  • Encode the text using this encoding
  • Encode it back to Unicode
  • Compare the results for errors
  • If no errors remember the encoding that produced the fewest bytes

Reference: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

If you are sure that your incoming encoding is ANSI or Unicode then you can also check for byte order mark. But let me tell you that it is not full-proof.

krammer
  • 2,598
  • 2
  • 25
  • 46
  • While this may theoretically answer the question, [it would be preferable](http://meta.stackexchange.com/q/8259) to include the essential parts of the answer here, and provide the links for reference. – Bill the Lizard Oct 24 '11 at 15:03