2

I am trying to create a method that can detect the encoding schema of a text file. I know there are many out there, but I know for sure my text file with be either ASCII, UTF-8, or UTF-16. I only need to detect these three. Anyone know a way to do this?

Icemanind
  • 47,519
  • 50
  • 171
  • 296
  • Do you know if they have a BOM (byte order mark)? If so, you can use that to determine the type. – alexn May 09 '12 at 19:11
  • 1
    You can safely ignore ASCII. Any valid ASCII file is always a valid UTF-8 file (assuming you’re using the correct 7-bit definition of ASCII). – Douglas May 09 '12 at 19:17
  • You are SOL if there is no BOM. – Mike Corcoran May 09 '12 at 19:19
  • 1
    @MikeCorcoran: Hardly. If you’re dealing with predominantly English text, then there are heuristics which give highly accurate results. For example, you can identify a UTF-16 file because most alternate bytes would be `\0`. – Douglas May 09 '12 at 19:24
  • unfortunately, I don't think there is a BOM. I just looked on a hex editor – Icemanind May 09 '12 at 20:05
  • Looking for `\0` bytes works even better for UTF-32, because the restriction of code points to below U+10FFFF *guarantees* that every fourth byte is zero. Not that the OP asked about it, but useful to know. – dan04 May 09 '12 at 20:08

2 Answers2

4

First, open the file in binary mode and read it into memory.

For UTF-8 (or ASCII), do a validation check. You can decode the text using Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes) and catch the exception. If you don't get one, the data is valid UTF-8. Here is the code:

private bool detectUTF8Encoding(string filename)
{
    byte[] bytes = File.ReadAllBytes(filename);
    try {
        Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes);
        return true;
    } catch {
        return false;
    }
}

For UTF-16, check for the BOM (FE FF or FF FE, depending on byte order).

Dan W
  • 3,520
  • 7
  • 42
  • 69
dan04
  • 87,747
  • 23
  • 163
  • 198
  • For UTF-8, you can also check for the BOM: `EF BB BF`. If present, this check would succeed much more quickly than decoding the text. – Douglas May 09 '12 at 19:35
  • **If** present. It's not necessary for UTF-8, and often omitted, especially on Unix-like systems. – dan04 May 09 '12 at 20:01
  • 2
    Yes, that’s true. But since it’s a quick check to perform, it’s worth throwing in for the few times it succeeds. – Douglas May 09 '12 at 20:35
1

Use the StreamReader to identify the encoding.

Example:

using(var r = new StreamReader(filename, Encoding.Default))
{
    richtextBox1.Text = r.ReadToEnd();
    var encoding = r.CurrentEncoding;
}
animaonline
  • 3,715
  • 5
  • 30
  • 57
  • 1
    You have to already know the encoding in order to use StreamReader. – dan04 May 09 '12 at 19:12
  • 1
    This answer is correct. [“A StreamReader will try to automatically detect the encoding of a file if there's a BOM when trying to read.”](http://stackoverflow.com/a/3746545/1149773) – Douglas May 09 '12 at 19:16
  • 1
    This method will fall back to the user's local encoding if it's not UTF8 which could be desirable. However, it won't be able to detect UTF8 if there's no BOM, even if it's perfectly valid UTF8 text. – Dan W Oct 11 '12 at 18:31