0

I'm a little bit confused about the file encoding. I want to change it. Here is my code:

public class ChangeFileEncoding
    {
        private const int BUFFER_SIZE = 15000;

        public static void ChangeEncoding(string source, Encoding destinationEncoding)
        {
            var currentEncoding = GetFileEncoding(source);
            string destination = Path.GetDirectoryName(source) +@"\"+ Guid.NewGuid().ToString() + Path.GetExtension(source);
            using (var reader = new StreamReader(source, currentEncoding))
            {
                using (var writer =new StreamWriter(File.OpenWrite(destination),destinationEncoding ))
                {
                    char[] buffer = new char[BUFFER_SIZE];
                    int charsRead;
                    while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
                    {
                        writer.Write(buffer, 0, charsRead);                        
                    }
                }
            }
            File.Delete(source);
            File.Move(destination, source);
        }

        public static Encoding GetFileEncoding(string srcFile)
        {
            using (var reader = new StreamReader(srcFile))
            {
                reader.Peek();
                return reader.CurrentEncoding;
            }
        }
    }

And in the Program.cs I have the code:

    string file = @"D:\path\test.txt";
    Console.WriteLine(ChangeFileEncoding.GetFileEncoding(file).EncodingName);
    ChangeFileEncoding.ChangeEncoding(file, new System.Text.ASCIIEncoding());
    Console.WriteLine(ChangeFileEncoding.GetFileEncoding(file).EncodingName);

And the text printed in my console is:

Unicode (UTF-8)

Unicode (UTF-8)

Why the file's encoding it's not changed? I am wrong in changing the file's encoding?

Regards

Community
  • 1
  • 1
Buda Gavril
  • 21,409
  • 40
  • 127
  • 196

2 Answers2

1

The StreamReader class, when not passed an Encoding in its constructor, will try to automatically detect the encoding of a file. It will do so just fine when the file starts with a BOM (and you should write the preamble when changing the encoding of a file to facilitate this the next time you want to read the file).

Properly detecting the encoding of a text file is a Hard Problem, especially for non-Unicode files or Unicode files without a BOM. The reader (whether StreamReader, Notepad++ or any other reader) will have to guess which encoding is being used in the file.

See also How can I detect the encoding/codepage of a text file, emphasis mine:

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results.

Because ASCII (characters 0-127) is a subset of Unicode, it's safe to read an ASCII file with a one-byte Unicode encoding (being UTF-8). Hence the StreamReader using that encoding.

That is, as long as it's truly ASCII. Any character above code point 127 will be ANSI, and then you're into the fun of detecting guessing the correct code page.

So to answer your question: you have changed the file's encoding, there simply is no fool-proof way to "detect" it, you can merely guess it.

Required reading material: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Unicode, UTF, ASCII, ANSI format differences.

Community
  • 1
  • 1
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • but this is just a case, I will have files with different encodings and to change their encoding in utf-8, utf-8 withoutboom, etc... – Buda Gavril Mar 11 '16 at 10:20
  • Yeah and that works just fine. StreamReader does encoding detection, and without a BOM, it assumes UTF-8 unless specified otherwise. – CodeCaster Mar 11 '16 at 10:23
  • so if set the writer encoding to ASCII, why the second message from the console is still UTF-8 ? I think the second message should be US-ASCII. If I open the file in Notepad++, in the encoding menu it's selected "Encode in ANSI" – Buda Gavril Mar 11 '16 at 10:27
  • Like I said, because ASCII is a subset of Unicode, it's safe to read an ASCII file with an UTF-8 encoding. – CodeCaster Mar 11 '16 at 10:32
  • @BudaGavril And, Notepad++ is just guessing, too. There are always valid guesses. It's up to you to set it correctly if you edit the file. It is always the writer that determines the encoding and it's the "senders" responsibility to communicate it to the reader. – Tom Blodget Mar 12 '16 at 00:01
0

Detecting using StreamReader.CurrentEncoding is a bit tricky, since that won't say what encoding the file uses, but what encoding the StreamReader needs to read it. Basically, there's no easy way to detect the encoding if there is no BOM without reading the whole file (and analyzing what you find there, it's not trivial).

For files with a BOM, it's easy:

public static Encoding GetFileEncoding(string srcFile)
{
   var bom = new byte[4];
   using (var f = new FileStream(srcFile, FileMode.Open, FileAccess.Read))
     f.Read(bom, 0, 4);

   if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
   if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
   if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode;
   if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode;
   if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
   // No BOM, so you choose what to return... the usual would be returning UTF8 or ASCII
   return Encoding.UTF8;
}
Jcl
  • 27,696
  • 5
  • 61
  • 92