1

my c# program receives string data (via windows message queue) which sometimes includes a char-133 in a string.

Is this a valid value in c#?

For example, if I do this:

string x = "a" + (char)133 + "b"; // 133 = 0x85

I can see the string x has length 3, but in the Visual Studio debugger I can only see x = "ab".

If I do the following, I get the "ellipsis" character (which I think the 133 is also supposed to be from the program which delivers it):

string y = "a" + (char)8230 + "b"; // 8230 = 0x2026

Thanks for any pointers.

xdzgor
  • 61
  • 1
  • 8
  • 1
    Your question shows that `(char)133` is a char in a string. What exactly is your question? – Martijn Jun 19 '15 at 08:39
  • Hi - thanks. I've investigated a little more, and begun to understand a little more about unicode, utf-8, and c# strings. You're correct, that the string does indeed contain the character 0x85. This character is though a "control character" in unicode (though an ellipsis character in extended ascii). There is obviously a mismatch in the "program stack" I'm working with - some program is sending ascii, which I can see is encoded as utf-8 in the windows-message-queue, and my program is reading as unicode. So no question! I just need to interpret the input correctly. – xdzgor Jun 19 '15 at 09:25
  • Please try to forget there ever was such a thing as ascii, it will only get in your way. I don't mind explaining a couple of things in chat if you want, but the most important takeaway is *there is no such thing as "plain text"*. It simply doesn't exist. – Martijn Jun 19 '15 at 13:58

1 Answers1

5

in a string there is no "invalid" value for a char. There are "invalid Unicode code points", but a string can contain them without problems, because string is a "stupid container" (but note that some string methods are "more intelligent" and don't like very much invalid code points... Normally they skip them/replace them with some substitution character)

Now... "visualizers" (modules/functions/methods that have to "show" a string) often have limitations and can't show all the characters (even perfectly valid ones)... A classsical example is Zalgo and Zalgo. This is your problem, but this is another problem :-)

To make an example, in Windows there are at least 4 "official" API to write text to the screen: GDI, GDI+, Uniscribe, DirectWrite. And many programs (games primarily) then use the FreeType library as an alternative... Each one of these libraries is compatible with some parts of Unicode.

I'll add that the character that creates problems to you (0x85) is called NEL or Next Line. It is a control character, so not something that should be "shown" and it has a complex and funny story, that could explain why it is sometimes shown as ellipsis:

the code for NEL has been used as the ellipsis ('…') character in Windows-1252.

For instance:

  • YAML[8] no longer recognizes them as special, in order to be compatible with JSON.

  • ECMAScript[9] accepts LS and PS as line breaks, but considers U+0085 (NEL) white space, not a line break.

  • Microsoft Windows 2000 does not treat any of NEL, LS or PS as line-break in the default text editor Notepad

On Linux, a popular editor, gedit, treats LS and PS as newlines but does not for NEL.

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280
  • Thanks! I've read up a little more on unicode, utf-8, and extended ascii. It appears that 0x85 is an extended ascii character for an ellipsis - and the program writing to the windows message queue is using this. I can see this character encoded as utf-8 bytes 0xc2 0x85 in the message, and my program reads the data as a c# unicode string - where the value is unicode 0x0085. This is, as you state, a unicode next-line, and explains why the VS debugger doesn't "show" it. The various programs I'm dealing with are using different character representations - I just need to understand and handle that. – xdzgor Jun 19 '15 at 09:33