10

I am having some issues with the default string encoding in C#. I need to read strings from certain files/packets. However, these strings include characters from the 128-256 range (extended ascii), and all of these characters show up as question marks , instead of the proper character. For example, when reading a string ,it could come up as "S?meStr?n?" if the string contained the extended ascii characters.

Now, is there any way to change the default encoding for my application? I know in java you could define the default character set from command line.

Dan
  • 734
  • 1
  • 9
  • 23
  • I've actually found out the issue. I wasn't too familiar with C#'s encoding feature. I've edited my Packet/File reading classes from Encoding.ASCII to Encoding.Default, and it actually seems to be reading the strings correctly now (from the packets at least). – Dan Jul 01 '11 at 16:10
  • Don't use Encoding.Default - it can change between machines and your code will not work properly (check out http://www.joelonsoftware.com/articles/Unicode.html in addition to Jon's and Sean's answers) – Alexei Levenkov Jul 01 '11 at 16:16
  • The problem may be in your viewer (Webpage, WPF application, etc). How are you displaying the text? Can you post some examples, please? – Kristofer Hoch Jul 01 '11 at 15:56

2 Answers2

23

There's no one single "extended ASCII" encoding. There are lots of different 8-bit encodings which are compatible with ASCII for the bottom 128 values.

You need to find out what encoding your files actually use, and specific that when reading the data with StreamReader (or whatever else you're using). For example, you may want encoding Windows-1252:

Encoding encoding = Encoding.GetEncoding(1252);

.NET strings are always sequences of UTF-16 code points. You can't change that, and you shouldn't try. (That's true in Java as well, and you really shouldn't use the platform default encoding when calling getBytes() etc unless that's what you really, really mean.)

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 3
    I have run up against this a few times while reading files from my European counterparts. The solution was, "please tell me how you are encoding your files" and then I can use the right encoding when reading the file. Usually, I end up putting the encoding in a config setting so that if they change we can still read in the file without a code change. – Jon Raynor Jul 01 '11 at 16:00
  • 2
    You should ask them to use UTF-8 unless they can give a good reason why they can't. – MRAB Jul 01 '11 at 17:01
  • Is it not even possible to create a simple ASCII (range of 0-127 characters) string constant? I'm asking this because I took a peek in my compiled C# code with TotalCommander and I thought to my self that it would be interesting to make it to an ASCII string in it. – Adam L. S. Apr 22 '12 at 09:04
  • @AdamL.S.: It's not clear what you mean. You can declare a string constant: `const string Foo = "XYZ";` - that's fine for any string. This question is about encoding - it's not clear how your comment relates to it. – Jon Skeet Apr 22 '12 at 09:07
  • @Jon Skeet: My comment relates to it in how it is stored inside the compiled *.exe file. So if I write `const string foo = "bar";` the “bar” string will appear in the binary as `b a r `, in UTF-16. (This can be checked with a hex editor or Total Commander Viewer.) However I found out, if I store it as a string resource, it will be stored as an UTF-8 string instead. Although it is completely irrelevant how it is stored, but it is an interesting thing to toy with, in my opinion. – Adam L. S. Apr 23 '12 at 13:44
  • 1
    @AdamL.S.: Okay - I'd actually expected that it would be in UTF-8, but I certainly wouldn't change any code on that front anyway. – Jon Skeet Apr 23 '12 at 14:01
2

An Encoding can be specified in at least one overload of functions for reading text - for example, ReadAllText(string, Encoding).

So if you no a file's encoded using Windows-1252, then you can specify it like so:

string contents = File.ReadAllText(someFilePath, Encoding.GetEncoding(1252));

Of course, doing this requires knowing ahead of time which code page is being used.

Sean U
  • 6,730
  • 1
  • 24
  • 43