0

In the documentation for StreamReader, it says:

StreamReader defaults to UTF-8 encoding unless specified otherwise

Does that mean when I read a file, it will treat this file as UTF-8 encoded? Or does it mean something else, because I have tested reading a UTF-16LE encoded file and it worked without a problem.

StreamReader sr = new StreamReader(new FileStream("D:\\1.txt", FileMode.Open, FileAccess.Read));
string str = sr.ReadToEnd();
Console.WriteLine(str);
sr.Close();
  • 2
    This one can help : http://stackoverflow.com/questions/3746530/auto-encoding-detect-in-c-sharp – Orace Feb 18 '15 at 15:32
  • @Orace So `StreamReader` will attempt to detect known encodings, but if it did not recognize any, it will treat the file as UTF-8? –  Feb 18 '15 at 15:37
  • 1
    If you don't specify the Encoding explicitly then StreamReader will try to auto-detect it from the file. If it can't figure it out then it falls back to utf8. That usually turns out well for a file that contains utf16 encoded text, such a file usually has a BOM. – Hans Passant Feb 18 '15 at 15:44

1 Answers1

2

Maybe, the simple way to know the answer is to perform some tests:

internal static class Program
{
    private static void Main()
    {
        var bytes1 = new byte[] {0x00, 0x61, 0x25, 0x54};
        var bytes2 = new byte[] {0xFE, 0xFF, 0x00, 0x61, 0x25, 0x54};
        var bytes3 = new byte[] {0xFF, 0xFE, 0x61, 0x00, 0x54, 0x25};

        Write(bytes1); // Writes: ' a%T'
        Write(bytes2); // Writes: 'a╔'
        Write(bytes3); // Writes: 'a╔'

        Console.ReadKey();
    }

    private static void Write(byte[] bytes)
    {
        using (var ms = new MemoryStream(bytes))
        {
            using (var sr = new StreamReader(ms))
            {
                var str = sr.ReadToEnd();
                Console.WriteLine(str);
            }
        }
    }
}

So if the stream first 2 bytes are the byte order mask (BOM) of UTF-16 Unicode (LE or BE), the stream will be read as a UTF-16 Unicode stream. Otherwise it will be read as a UTF-8 one.

[Edit]

Strangely the StreamReader Constructor (Stream, Encoding) contains informations that the StreamReader Constructor (Stream) do not have.

The StreamReader object attempts to detect the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used.

First remark here : the user-provided encoding is not necessarily used.

Now if you look at the reference implementation, the constructor with only a Stream as parameter is in fact a call to:

StreamReader(stream: stream, encoding: Encoding.UTF8, detectEncodingFromByteOrderMarks: true, bufferSize: DefaultBufferSize, leaveOpen: false)

So the informations above applies.

More precisely, it is this one:

If the detectEncodingFromByteOrderMarks parameter is true, the constructor detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used.

Orace
  • 7,822
  • 30
  • 45