C# Stream Reader does not differentiate between UTF-16 and UTF-8

Question

I am building an app that downloads a csv file in plain text from an e-mail server and writes it to the local file system. I am developing this app in C# using .NET Core 3.1.

The problem is that I don't know what is the encoding of the files that I am receiving, so I decided to use the StreamReader class to convert the bytes that I downloaded from the e-mail to a string.

Here is the code

foreach (var data in loadedData)
{
    if (IsValidData(data))
    {
        logger.Info($"Writing data from: {data.FileName}");

        using var stream = new MemoryStream(data.FileContent);
        using var reader = new StreamReader(stream, true);

        var csvData = new CSVData
        {
            FileName = data.FileName,
            FileContent = reader.ReadToEnd(),
        };
        dataWriter.WriteData(csvData);
        logger.Info($"Writing data from: {data.FileName} was successfully written");
    }
    else
    {
        logger.Warn($"Invalid format: {data.FileName}");
    }
}

And to write the data to the actual files I am using:

public void WriteData(CSVData data)
{
    logger.Debug($"Writing received file: {data.FileName}");

    var outputDir = config.GetReceivedFilesPath();
    string fileName = this.GetOutputPath(data.FileName, outputDir);

    Directory.CreateDirectory(outputDir);
    using var writer = new StreamWriter(fileName, false, Encoding.UTF8);
    writer.Write(data.FileContent);
    logger.Debug($"The received data was successfully written to: {data.FileName}");
}

The problem is that some files that I am receiving are encoded in UTF-16 (I believe this is the encodigng that is being used, because there is a \0 after each char), but the StreamReader is interpreting this file as encoded in UTF-8, because the reader.CurrentEncoding property returns UTF-8.

The end result is that instead of having my files outputted as UTF-8, my app is outputting them as UTF-16, even though I explicity added UTF-8 as the output value.

What I am doing wrong?

It can only detect UTF-16 if the BOM is present. What are the first 4 bytes of `data.FileContent`? — madreflection, Oct 07 '20 at 23:22
The documentation states that `StreamReader` _always_ default to UTF-8 unless specified otherwise, so you'll have do the detection yourself and then pass the right `Encoding` to the reader's constructor. — Etienne de Martel, Oct 07 '20 at 23:29
First the Encoding of the incoming file needs to be found. The combination present in the first 4 bytes can tell what encoding is present. There are many examples out there. StreamReader does not detect the correct Encoding. It just uses the default one , if nothing mentioned. Once the encoding is known, you can use that in StreamReader. — , Oct 08 '20 at 00:01
@GargiD.Chakravarty: _"combination present in the first 4 bytes can tell what encoding is present"_ -- incorrect. UTF16 BOM is 2 bytes. An unofficial "BOM" is often used in UTF8, and that's 3 bytes. If either are present, `StreamReader` _will_ correctly detect the encoding (so _"StreamReader does not detect the correct Encoding"_ is also incorrect). If neither are present, `StreamReader` will default to UTF8; if the file is not actually UTF8, it's up to the caller to explicitly provide the correct encoding. — Peter Duniho, Oct 08 '20 at 00:35
@EtiennedeMartel: _"documentation states that StreamReader always default to UTF-8"_ -- the documentation is misleading. Yes, the default encoding used is UTF8. But `StreamReader` will respect the encoding specified by a UTF16 or UTF8 BOM, if found, unless a constructor with the `detectEncodingFromByteOrderMarks` parameter is used, and `false` is passed for the value of that parameter. — Peter Duniho, Oct 08 '20 at 00:45
_"the StreamReader is interpreting this file as encoded in UTF-8"_ -- the only way for `StreamReader` to correctly decode a UTF16 input is if either a) the input includes a UTF16 BOM as the first two bytes, or b) you explicitly tell it by passing `Encoding.Unicode` to the constructor. If neither of these apply, you will get incorrect results. The only way to change that is to fix your code so that either the input is formed correctly for your code (i.e. has the BOM), or you are explicit about which encoding to use when you create the `StreamReader` see duplicate for more details. — Peter Duniho, Oct 08 '20 at 00:52
@PeterDuniho: It's true, the UTF-16 BOM is only 2 bytes, but Gargi and I were talking about 4 bytes because it can be *up to* that many. The code looks for `FE FF` but then looks for `00 00` in the next two bytes in case it could be UTF-32 LE. If it doesn't find that, it's UTF-16 LE. So the first 4 bytes *can* tell what encoding is present, even if not all 4 bytes are part of what indicates it. — madreflection, Oct 08 '20 at 06:05
@madreflection: _"looks for 00 00 in the next two bytes in case it could be UTF-32 LE"_ -- I see what you mean. But BOM in UTF32 is practically never used, because you can't actually distinguish between that and a UTF16 file that has `'\u0000'` as its first character (a perfectly legal construction). .NET will assume it's UTF32, but that's not necessarily going to be correct. And of course, for a question that concerns only UTF16 and UTF8, only the first three bytes at most are relevant. — Peter Duniho, Oct 09 '20 at 03:13
Hi, thank you all for your help and comments. In the end I discovered that the files that I am receiving are encoded as Unicode. And I decided to use the code in this answer: https://stackoverflow.com/a/19283954/839211 to try to guess the encoding. If it is not successful, I fallback to unicode. — Felipe, Oct 14 '20 at 10:01

score -1 · Answer 1 · answered Oct 07 '20 at 23:21

You might be able to use this method

File.ReadAllText(string path, System.Text.Encoding encoding)

Based on the documentation it tries to figure this out automatically. The below text is from the documentation

This method opens a file, reads all the text in the file, and returns it as a string. It then closes the file.

This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.

The file handle is guaranteed to be closed by this method, even if exceptions are raised.

To use the encoding settings as configured for your operating system, specify the Encoding.Default property for the encoding parameter.

The full document can be found here

`ReadAllText()` follows the exact same heuristic as `StreamReader`. This answer is not in any way a solution to the question that was asked. — Peter Duniho, Oct 08 '20 at 00:33

C# Stream Reader does not differentiate between UTF-16 and UTF-8

1 Answers1