How to find out the Encoding of a File? C#

Question

Well i need to find out which of the files i found in some directory is UTF8 Encoded either ANSI encoded to change the Encoding in something else i decide later. My problem is.. how can i find out if a file is UTF8 or ANSI Encoded? Both of the encodings are actually posible in my files.

score 14 · Accepted Answer · edited Oct 08 '14 at 01:03

14

There is no reliable way to do it (since the file might be just random binary), however the process done by Windows Notepad software is detailed in Micheal S Kaplan's blog:

http://www.siao2.com/2007/04/22/2239345.aspx

Check the first two bytes; 1. If there is a UTF-16 LE BOM, then treat it (and load it) as a "Unicode" file; 2. If there is a UTF-16 BE BOM, then treat it (and load it) as a "Unicode (Big Endian)" file; 3. If the first two bytes look like the start of a UTF-8 BOM, then check the next byte and if we have a UTF-8 BOM, then treat it (and load it) as a "UTF-8" file;

Check with IsTextUnicode to see if that function think it is BOM-less UTF-16 LE, if so, then treat it (and load it) as a "Unicode" file;

Check to see if it UTF-8 using the original RFC 2279 definition from 1998 and if it then treat it (and load it) as a "UTF-8" file;

Assume an ANSI file using the default system code page of the machine.

Now note that there are some holes here, like the fact that step 2 does not do quite as good with BOM-less UTF-16 BE (there may even be a bug here, I'm not sure -- if so it's a bug in Notepad beyond any bug in IsTextUnicode).

edited Oct 08 '14 at 01:03

Community

1
1

answered Aug 04 '10 at 09:32

sukru

2,229
14
15

9

StreamReader does this automatically if you pass `true` for the `detectEncodingFromByteOrderMarks` parameter. http://msdn.microsoft.com/en-us/library/7bc2hwcb.aspx – dtb Aug 04 '10 at 09:35
Thanks, I did not know .net had internal support for this procedure! – sukru Aug 04 '10 at 09:36
1

In my tests, the ´detectEncodingFromByteOrderMarks´ flags does not detect ANSI encoding – Bertvan Mar 21 '13 at 11:20
Definitely the StreamReader can't detect ANSI. – ICantSeeSharp Jan 27 '14 at 14:45
@ICantSeeSharp How-to ***detect*** `ANSI` ? – Kiquenet Dec 04 '17 at 08:51

score 4 · Answer 2 · answered Aug 04 '10 at 09:33

http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2

There is no great way to detect an arbitrary ANSI code page, though there have been some attempts to do this based on the probability of certain byte sequences in the middle of text. We don't try that in StreamReader. A few file formats like XML or HTML have a way of specifying the character set on the first line in the file, so Web browsers, databases, and classes like XmlTextReader can read these files correctly. But many text files don't have this type of information built in.

score 2 · Answer 3 · answered Aug 04 '10 at 09:43

Unicode/UTF8/UnicodeBigEndian are considered to be different types. ANSI is considered the same as UTF8.

public class EncodingType
{
    public static System.Text.Encoding GetType(string FILE_NAME)
    {
        FileStream fs = new FileStream(FILE_NAME, FileMode.Open, FileAccess.Read);
        Encoding r = GetType(fs);
        fs.Close();
        return r;
    }

    public static System.Text.Encoding GetType(FileStream fs)
    {
        byte[] Unicode = new byte[] { 0xFF, 0xFE, 0x41 };
        byte[] UnicodeBIG = new byte[] { 0xFE, 0xFF, 0x00 };
        byte[] UTF8 = new byte[] { 0xEF, 0xBB, 0xBF }; //with BOM
        Encoding reVal = Encoding.Default;

        BinaryReader r = new BinaryReader(fs, System.Text.Encoding.Default);
        int i;
        int.TryParse(fs.Length.ToString(), out i);
        byte[] ss = r.ReadBytes(i);
        if (IsUTF8Bytes(ss) || (ss[0] == 0xEF && ss[1] == 0xBB && ss[2] == 0xBF))
        {
            reVal = Encoding.UTF8;
        }
        else if (ss[0] == 0xFE && ss[1] == 0xFF && ss[2] == 0x00)
        {
            reVal = Encoding.BigEndianUnicode;
        }
        else if (ss[0] == 0xFF && ss[1] == 0xFE && ss[2] == 0x41)
        {
            reVal = Encoding.Unicode;
        }
        r.Close();
        return reVal;

    }

    private static bool IsUTF8Bytes(byte[] data)
    {
        int charByteCounter = 1;　 
        byte curByte; 
        for (int i = 0; i < data.Length; i++)
        {
            curByte = data[i];
            if (charByteCounter == 1)
            {
                if (curByte >= 0x80)
                {
                    while (((curByte <<= 1) & 0x80) != 0)
                    {
                        charByteCounter++;
                    }
                    　
                    if (charByteCounter == 1 || charByteCounter > 6)
                    {
                        return false;
                    }
                }
            }
            else
            {
                if ((curByte & 0xC0) != 0x80)
                {
                    return false;
                }
                charByteCounter--;
            }
        }
        if (charByteCounter > 1)
        {
            throw new Exception("Error byte format");
        }
        return true;
    }

}

That looks like a great bit of code. However, isn't UTF16 LE and UTF16-BE supposed to have the signature "FF FE" and "FE FF" respectively? You've added an extra byte. See: http://www.unicode.org/faq/utf_bom.html#bom4 — Dan W, Oct 12 '12 at 00:14
By the way, how does your IsUTF8Bytes() function compare to Christoph's answer shown here: http://stackoverflow.com/a/1031773/848344 — Dan W, Oct 12 '12 at 01:49
@DanW: I don't know where is Christoph's answer from, the code I posted is part of the project I've worked, written by a teammate. — Cheng Chen, Oct 12 '12 at 02:18
I suggest using the [`using` statement](http://msdn.microsoft.com/en-US/library/yh598w02.aspx) instead of manually closing/disposing the objects just to ensure these get disposed even in case of an unexpected exception. — mbx, Jun 07 '13 at 13:09

score 0 · Answer 4 · answered Aug 04 '10 at 09:31

0

See these two codeproject articles - it is not trivial to find out file encoding simply from the file content:

answered Aug 04 '10 at 09:31

Oded

489,969
99
883
1,009

MiguelSlv · Answer 5 · 2020-10-07T10:57:16.503

-1

   public static System.Text.Encoding GetEncoding(string filepath, Encoding defaultEncoding)
    {
        // will fall to defaultEncoding if file does not have BOM

        using (var reader = new StreamReader(filepath, defaultEncoding, true))
        {
            reader.Peek(); //need it
            return reader.CurrentEncoding;
        }
    }

Check Byte Order Mark (BOM).

To see the BOM you need to see file in a hexadecimal view.

Notepad show the file encoding at status bar, but it can be just estimated, if the file hasn't the BOM set.

edited Oct 07 '20 at 10:57

answered Oct 07 '20 at 10:52

MiguelSlv

14,067
15
102
169

I tried this, and everthing (UTF-16-BE, LE, ansi and utf8) all brought back utf-8. : – PHenry Jun 16 '21 at 20:18

How to find out the Encoding of a File? C#

5 Answers5

Linked

Related