Detecting encoding is always a tricky business, but detecting BOMs is dead simple. To get the BOM as byte array, just use the GetPreamble()
function of the encoding objects. This should allow you to detect a whole range of encodings by preamble.
Now, as for detecting UTF-8 without preamble, actually that's not very hard either. See, UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect.
So if you first do the BOM check, and then the strict decoding check, and finally fall back to Win-1252 encoding (what you call "ANSI") then your detection is done.
Byte[] bytes = File.ReadAllBytes(filename);
Encoding encoding = null;
String text = null;
// Test UTF8 with BOM. This check can easily be copied and adapted
// to detect many other encodings that use BOMs.
UTF8Encoding encUtf8Bom = new UTF8Encoding(true, true);
Boolean couldBeUtf8 = true;
Byte[] preamble = encUtf8Bom.GetPreamble();
Int32 prLen = preamble.Length;
if (bytes.Length >= prLen && preamble.SequenceEqual(bytes.Take(prLen)))
{
// UTF8 BOM found; use encUtf8Bom to decode.
try
{
// Seems that despite being an encoding with preamble,
// it doesn't actually skip said preamble when decoding...
text = encUtf8Bom.GetString(bytes, prLen, bytes.Length - prLen);
encoding = encUtf8Bom;
}
catch (ArgumentException)
{
// Confirmed as not UTF-8!
couldBeUtf8 = false;
}
}
// use boolean to skip this if it's already confirmed as incorrect UTF-8 decoding.
if (couldBeUtf8 && encoding == null)
{
// test UTF-8 on strict encoding rules. Note that on pure ASCII this will
// succeed as well, since valid ASCII is automatically valid UTF-8.
UTF8Encoding encUtf8NoBom = new UTF8Encoding(false, true);
try
{
text = encUtf8NoBom.GetString(bytes);
encoding = encUtf8NoBom;
}
catch (ArgumentException)
{
// Confirmed as not UTF-8!
}
}
// fall back to default ANSI encoding.
if (encoding == null)
{
encoding = Encoding.GetEncoding(1252);
text = encoding.GetString(bytes);
}
Note that Windows-1252 (US / Western European ANSI) is a one-byte-per-character encoding, meaning everything in it produces a technically valid character, so unless you go for heuristic methods, no further detection can be done on it to distinguish it from other one-byte-per-character encodings.