Text file encoding issue

Question

I found some questions on encoding issues before asking, however they are not what I want. Currently I have two methods, I'd better not modify them.

//FileManager.cs
public byte[] LoadFile(string id);
public FileStream LoadFileStream(string id);

They are working correctly for all kind of files. Now I have an ID of a text file(it's guaranteed to be a .txt file) and I want to get its content. I tried the following:

byte[] data = manager.LoadFile(id);
string content = Encoding.UTF8.GetString(data);

But obviously it's not working for other non-UTF8 encodings. To resolve the encoding issue I tried to get its FileStream first and then use a StreamReader.

public StreamReader(Stream stream, bool detectEncodingFromByteOrderMarks);

I hope this overlord can resolve the encoding but I still get strange contents.

using(var stream = manager.LoadFileStream(id))
using(var reader = new StreamReader(stream, true))
{
    content = reader.ReadToEnd();    //still incorrect
}

Maybe I misunderstood the usage of detectEncodingFromByteOrderMarks? And how to resolve the encoding issue?

What do you mean with "Strange contents"? What is the content you get? What should it look like? Did you have a look at the files first bytes using a hex editor? — PVitt, Oct 18 '11 at 08:27
You [can't detect the encoding](http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file) you must know it yourself. — Shadow The GPT Wizard, Oct 18 '11 at 08:27

C.Evenhuis · Accepted Answer · 2011-10-18T08:44:22.457

ByteOrderMarks are sometimes added to files encoded in one of the unicode formats, to indicate whether characters made up from multiple bytes are stored in big or little endian format (is byte 1 stored first, and then byte 0? Or byte 0 first, and then byte 1?). This is particularly relevant when files are read both by for instance windows and unix machines, because they write these multibyte characters in opposite directions.

If you read a file and the first few bytes equal that of a ByteOrderMark, chances are quite high the file is encoded in the unicode format that matches that ByteOrderMark. You never know for sure, though, as Shadow Wizard mentioned. Since it's always a guess, the option is provided as a parameter.

If there is no ByteOrderMark in the first bytes of the file, it'll be hard to guess the file's encoding.

More info: http://en.wikipedia.org/wiki/Byte_order_mark

Text file encoding issue

1 Answers1