0

I'm currently trying to educate my self about the different Encoding types. I tried to make a simple console app to tell me the difference between the types.

byte[] byteArray = new byte[] { 125, 126, 127, 128, 129, 130, 250, 254, 255 };
string s = Encoding.Default.GetString(byteArray);
Console.OutputEncoding = Encoding.Default;
Console.WriteLine("Default: " + s);

s = Encoding.ASCII.GetString(byteArray);
Console.OutputEncoding = Encoding.ASCII;
Console.WriteLine("ASCII: " + s);

s = Encoding.UTF8.GetString(byteArray);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("UTF8: " + s);

The output however is nothing like I expected it to be.

Default: }~€‚úûüýþÿ
ASCII: }~?????????
UTF8: }~���������

Hmm... the characters do not copy well from the console output to here either so here's a print screen.

Console output printscreen

What I do expect is to see the the extended ASCII characters. The default encoding is almost correct but it cannot display 251, 252 and 253 but that might be a shortcoming on the Console.writeLine() though i'd not expect that.

enter image description here

The representation of the variable when debugging is as follows:

Default encoded string = "}~€‚úûüýþÿ"
ASCII encoded string = "}~?????????"
UTF8 encoded string = "}~���������"

Can someone tell me what I'm doing wrong? I expect one of the encoding types to properly display the extended ASCII table but apparently none can...

A bit of context:
I am trying to determine what Encoding would be best a standard in our company, I personally think UTF8 will do but my supervisor would like to see some examples before we decide.

Obviously we know we will need to use other encoding types every now and then (serial communication for example uses 7-bits so we can't use UTF8 there) but in general we would like to stick with one encoding type. Currently we are using default, ASCII and UTF8 at random so that's not a good thing.

EDIT
The output according to:

Console.WriteLine("Default: {0} for {1}", s, Console.OutputEncoding.CodePage);

output with code page

Edit 2:
Since I thought there might not be an encoding in which the extended ascii characters correspond to the decimal numbers in the table I linked to I turned it around and this:

char specialChar = '√';
int charNumber = (int)specialChar;

gives me the number: 8730 which in the table is 251

Vincent
  • 1,497
  • 1
  • 21
  • 44
  • 1
    What is the code page of your console? Look in the Properties dialog. Note that there's no one encoding called "extended ASCII" - there are lots of *different* 8-bit encodings which share ASCII for the first 128 values. (And yes, UTF-8 is almost certainly the best choice to standardize on.) – Jon Skeet Nov 02 '15 at 08:24
  • You should separate two concerns here, by the way: a) what characters you can print on your console; b) what characters are in your strings. You can determine the latter *without* the former... see http://csharpindepth.com/Articles/General/Strings.aspx for sample code. – Jon Skeet Nov 02 '15 at 08:28
  • @JonSkeet Good question, I thought `Console.OutputEncoding = Encoding.ASCII;` would give me the right code page. – Vincent Nov 02 '15 at 08:36
  • No, that's changing how you're converting bytes to a string - it doesn't affect what the console is capable of displaying. – Jon Skeet Nov 02 '15 at 08:37
  • You click in the top left corner, then choose "Properties" - that's why I said to look in the Properties dialog. See also http://stackoverflow.com/questions/388490 – Jon Skeet Nov 02 '15 at 08:43
  • @JonSkeet Well i cannot find a code page, but I've used this code: `int lcid = GetSystemDefaultLCID(); var ci = System.Globalization.CultureInfo.GetCultureInfo(lcid); var page = ci.TextInfo.OEMCodePage;` To get me: 850 – Vincent Nov 02 '15 at 08:50
  • I'm not talking about looking in code. You have a console window open, right? Click on the top left of it, and select the Properties menu item. I'd expect to see a code page there. (If not, which version of Windows are you using?) – Jon Skeet Nov 02 '15 at 08:51
  • @JonSkeet I'm using windows 7 professional (dutch version) if I go to properties of my cmd window (the one that opens when running the application) i cannot see anything that resembles a code page. I have a tab 'Options', 'Font type', 'layout' and 'Colors'. non of those have a code page. Using chcp command in cmd.exe gives me 850 as well – Vincent Nov 02 '15 at 08:57
  • Hmm. Maybe it's a Windows 8+ feature then. Try using `chcp 65001` to change it to UTF-8 then... – Jon Skeet Nov 02 '15 at 09:04
  • Try `Console.WriteLine("Default: {0} for {1}", s, Console.OutputEncoding.HeaderName);` to see actual output encoding. – Wernfried Domscheit Nov 02 '15 at 09:07
  • @JonSkeet that's what this: `Console.OutputEncoding = Encoding.UTF8;` should do. Also `chcp 65001` works until i close the cmd window. then it resets to 850 – Vincent Nov 02 '15 at 09:26
  • @WernfriedDomscheit I've added a picture of the console output, thanks for the easy way to check it. – Vincent Nov 02 '15 at 09:28
  • @VincentAdvocaat: Ah, had missed that part of `Console.OutputEncoding`. Hmm. – Jon Skeet Nov 02 '15 at 09:50

2 Answers2

3

The output encoding in your case should be mostly irrelevant since you're not even working with Unicode. Furthermore, you need to change your console window settings from Raster fonts to a TrueType font, like Lucida Console or Consolas. When the console is set to raster fonts, you can only have the OEM encoding (CP850 in your case), which means Unicode doesn't really work at all.

However, all that is moot as well, since your code is ... weird, at best. First, as to what is happening here: You have a byte array, interpret that in various encodings and get a (Unicode) string back. When writing that string to the console, the Unicode characters are converted to their closest equivalent in the codepage of the console (850 here). If there is no equivalent, not even close, then you'll get a question mark ?. This happens most prominently with ASCII and characters above 127, because they simply don't exist in ASCII.

If you want the characters you want to see, then either use correct encodings throughout instead of trying to meddle around until it somewhat works, or just use the right characters to begin with.

Console.WriteLine("√ⁿ²")

should actually work because it runs through the encoding translation processes described above.

Joey
  • 344,408
  • 85
  • 689
  • 683
  • When setting `Console.OutputEncoding = Encoding.Default;` and the font type to lucida console I do get the correct output. And the byte vaules of the extended ascii table i linked to are not for utf-8, utf-8 has different numbers for the extended ascii. so i could perfectly save anything in the utf-8 format i just need to be carefull not to pass the 254 byte values to a utf-8 encoding and expect extended ascii. – Vincent Nov 02 '15 at 10:18
1

Strange, with this code

Console.OutputEncoding = Encoding.Default;
Console.WriteLine("Default: {0} for {1}", s, Console.OutputEncoding.HeaderName);
s = Encoding.ASCII.GetString(byteArray);
Console.OutputEncoding = Encoding.ASCII;
Console.WriteLine("ASCII: {0} for {1}", s, Console.OutputEncoding.HeaderName);
s = Encoding.UTF8.GetString(byteArray);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("UTF8: {0} for {1}", s, Console.OutputEncoding.HeaderName);

I get this one:

Default: }~€‚úþÿ for Windows-1252
ASCII: }~?????? for us-ascii
UTF8: }~ ������ for utf-8

This is what I would expect. Default Codepage is CP1252, not CP850 which your tables shows. Try another default font in for your console, e.g. "Consolas" or "Lucidia Console" and check the output.

Wernfried Domscheit
  • 54,457
  • 9
  • 76
  • 110