4

Possible Duplicate:
How can I detect the encoding/codepage of a text file

I've been developing a winform system. And need to read txt file.

Unfortunately, there are many txt encoded files. I can't read it use a specific encoding.

The problem is how to judge a txt file encoding.

Community
  • 1
  • 1
Justin
  • 676
  • 1
  • 8
  • 24
  • 1
    @Gens, BOM is for Unicode encoded files that specify the endianness of the file. That's not the same as the encoding which can be anything, including non-Unicode. – Samuel Neff Jun 16 '11 at 02:17

2 Answers2

2

See this answer here:

How can I detect the encoding/codepage of a text file

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

and the article it links to:

http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

Community
  • 1
  • 1
Samuel Neff
  • 73,278
  • 17
  • 138
  • 182
2

In @Gens and @Samuel Neff cues, I solve the problem. Here is my code.

public static Encoding GetFileEncoding(string srcFile)
{
 // *** Use Default of Encoding.Default (Ansi CodePage)
            Encoding encoding = Encoding.Default;
            using (FileStream stream = File.OpenRead(fileName))
            {
                // *** Detect byte order mark if any - otherwise assume default
                byte[] buff = new byte[5];
                stream.Read(buff, 0, buff.Length);
                if (buff[0] == 0xEF && buff[1] == 0xBB && buff[2] == 0xBF)
                {
                    encoding = Encoding.UTF8;
                }
                else if (buff[0] == 0xFE && buff[1] == 0xFF)
                {
                    encoding = Encoding.BigEndianUnicode;
                }
                else if (buff[0] == 0xFF && buff[1] == 0xFE)
                {
                    encoding = Encoding.Unicode;
                }
                else if (buff[0] == 0 && buff[1] == 0 && buff[2] == 0xFE && buff[3] == 0xFF)
                {
                    encoding = Encoding.UTF32;
                }
                else if (buff[0] == 0x2B && buff[1] == 0x2F && buff[2] == 0x76)
                {
                    encoding = Encoding.UTF7;
                }
            }
            return encoding;
}
Justin
  • 676
  • 1
  • 8
  • 24
  • 1
    + 1 Thats gold I was going to suggest some "magic" thats similar but its for detecting MIME Types. Its a library called Winista and I default to URLMon for files it cant detect, see here: http://social.msdn.microsoft.com/forums/en-US/Vsexpressvcs/thread/d79e76e3-b8c9-4fce-a97d-94ded18ea4dd/ – Jeremy Thompson Jun 16 '11 at 03:43
  • @Jeremy Thompson, thank you. I've read it and learn more.+1 – Justin Jun 16 '11 at 06:30