72

I'm writing a TFS Checkin policy, which checks if our source files containing our file header.

My problem is, that our file header contains a special character "©" and unfortunately some of our source files are encoded in ANSI. So if I read these files in the policy, the string looks like this "Copyright � 2009".

string content = File.ReadAllText(pendingChange.LocalItem);

I tired to change the encoding of the string, but it does not help. So how can I read these files, that I get the correct string "Copyright © 2009"?

Kiquenet
  • 14,494
  • 35
  • 148
  • 243
Enyra
  • 17,542
  • 12
  • 35
  • 44

3 Answers3

144

Use Encoding.Default:

string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);

You should be aware, however, that that reads it using the system default encoding - which may not be the same as the encoding of the file. There's no single encoding called ANSI, but usually when people talk about "the ANSI encoding" they mean Windows Code Page 1252 or whatever their box happens to use.

Your code will be more robust if you can find out the exact encoding used.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I find out the encoding type with the preamble of the encodings, afterwards it works fine, thanks. – Enyra Sep 16 '09 at 12:50
  • Thanks for the _system default encoding_ hint, expected 1252 while it was actually UTF-8. – Onkel Toob Oct 30 '18 at 11:49
  • This is no longer valid in .net 5.0 – T-moty Jan 15 '21 at 09:41
  • 1
    @T-moty: Um, what do you mean by "no longer valid"? Do you mean `Encoding.Default` has a different meaning in .NET 5.0 compared with .NET Framework 4.x? (It's unfortunate that [the documentation](https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.default?view=net-5.0) lists the behavior on .NET Framework and .NET Core, but not .NET 5.0, but just saying "no longer valid" really doesn't give enough information for the comment to be useful IMO.) – Jon Skeet Jan 15 '21 at 10:27
  • 1
    Sorry @JonSkeet, I didn't realize I was commenting on your answer (you are pretty darn famous here :-)). For completion on my previous comment, the property `System.Text.Encoding.Default`, from .net core 1.0, no longers returns the 1252 codepage encoding: instead it yields `System.Text.UTF8Encoding.UTF8EncodingSealed`. As far as i know, this behavior is relative only to "core" frameworks and the fifth version (that is, underlying, core). Have a nice weekend :-) – T-moty Jan 15 '21 at 16:43
6

It would seem sensible if you going to have such policies that you would also have team agreed standard encoding. To be honest, I can't see why any team would use an encoding other than "Unicode (UtF-8 with signature) - Codepage 65001" (except perhaps for ASPX pages with significant non-latin static content but even then I can't see how it would be a big deal to use UTF-8).

Assuming you still want to allow mixed encodings then you next need a way to determine which encoding a file was save in so you know which encoding to pass to ReadAllText. Its not easy to determine this from the file however using Encoding.Default is likely to work ok. Since its most likely you have just 2 encodings to deal with, the VS (UTF-8 with signature) and a common ANSI encoding used by you machines (probably Windows-1252).

Hence using

 string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);

will work. (As I see Jon has already posted). This works because when the UTF-8 BOM (which is what VS means by the term "signature") is present at the start of the file the supplied encoding parameter is ignored and UTF-8 is used anyway. Hence where the file is saved using UTF-8 you get correct results and where ANSI is used you are most likely also to get correct results.

BTW if you are processing file headers wouldn't ReadAllLines make things easier?.

AnthonyWJones
  • 187,081
  • 35
  • 232
  • 306
  • Your solution of just using Encoding.Default would fail though if the input was a UTF8 file, but didn't have a BOM (as not all UTF files come with BOMs of course). – Dan W Oct 10 '12 at 06:26
  • 1
    Thanks for pointing out that even when using "Encoding.Default" if a BOM is found at the beginning of the file it will fall back to UTF8. This saved my day. – carlos357 Jun 24 '16 at 09:46
  • 1
    Important note from answer: supplied encoding parameter IS IGNORED if BOM is found at the beginning of the file. This was driving me nuts last night. And thanks @carlos357 for the comment, it gave me the idea to check for that this morning. I have a text file with a BOM which is actually ANSI/1252 encoded. Bad data from a vendor. I got around this problem by using File.ReadAllBytes, converting the byte[] to a string using Encoding.GetEncoding(1252).GetString, and then trimming off the BOM. – poprogrammer Feb 22 '17 at 16:34
  • How trimming off the BOM ? – Kiquenet May 29 '20 at 09:29
4

I know this is an old question but I ran into a similar situation and found the accepted answer to be cutting some corners (no disregard for Jon Skeet's pragmatic short answer, but I'll flesh it out a little more)...

The specs state that the header will contain the encoding directly after {\rtf:

 \ansi  ANSI (the default)
 \mac   Apple Macintosh
 \pc    IBM PC code page 437 
 \pca   IBM PC code page 850, used by IBM Personal System/2 (not implemented in version 1 of Microsoft Word for OS/2)

According to Wikipedia the "ANSI character set has no well-defined meaning"

For the default ANSI you have the choice of these partially incompatible encodings:

using System.Text;
...
string content = File.ReadAllText(filename, Encoding.GetEncoding("ISO-8859-1"));
or
string content = File.ReadAllText(filename, Encoding.GetEncoding("Windows-1252"));

Using WordPad on windows 10 to save a file with a euro sign (0x80 in Windows-1252 but 0xA4 in ISO-8859-1) revealed the following:

The header stated the exact encoding after \ansi

{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1043{ ...

And the encoding was not directly used, instead it was wrapped in RTF encoding: \'80

according to the specs:

\'hh : A hexadecimal value, based on the specified character set (may be used to identify 8-bit values).

I guess the best thing to do is read the header, if the file starts with {\rtf1\ansi\ansicpg1252 then go for Windows-1252.

But to make things more complicated, the specs also state that there can be mixed encodings... search for '\upr'...

I guess there is no definitive answer, the easiest way to go in your case may be to search (in the un-decoded raw byte array) for all the variations of the encoded copyright signs that you may encounter in your source base.

In my case I finally decided to cut a few corners as well, but add a small percentage of defensive coding. All files I have seen so far were Windows-1252 so I common-case-optimised for that.

    Encoding encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
    
    using (System.IO.StreamReader reader = new System.IO.StreamReader(filename, encoding)) {
        string header= reader.ReadLine();
        if (!header.Contains("cpg1252")) {
            if(header.Contains("\\pca"))
                encoding = Encoding.GetEncoding(850, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
            else if (header.Contains("\\pc"))
                encoding = Encoding.GetEncoding(437, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
            else
                encoding = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
        }
    }
    
    string content = System.IO.File.ReadAllText(filename, encoding);
Louis Somers
  • 2,560
  • 3
  • 27
  • 57