4

I'm reading that C# stores unicode characters in char (aka System.Char) variables, which have fixed length of 16bits. However, 16 bits are not enough to store all Unicode characters! How, in this case, do C#'s char variables support Unicode?

  • 1
    https://en.wikipedia.org/wiki/UTF-16 – SLaks Apr 19 '18 at 14:48
  • Char can support any unicode in the range of `U+0000 to U+FFFF` – maccettura Apr 19 '18 at 14:48
  • It only supports character ranges from 0x0000 to 0xFFFF (which is 65,536 characters). If you want it to display different code pages, you need to set the [code page](https://msdn.microsoft.com/en-us/library/system.text.encoding.codepage(v=vs.110).aspx). – Ron Beyer Apr 19 '18 at 14:48
  • 4
    Possible duplicate of [Using unicode characters bigger than 2 bytes with .Net](https://stackoverflow.com/questions/16816528/using-unicode-characters-bigger-than-2-bytes-with-net) – maccettura Apr 19 '18 at 14:49
  • With UTF-16. It is like UTF-8 based on 16-bit Unicode. – i486 Apr 19 '18 at 14:58
  • Perfectly valid question. Read [this answer](https://stackoverflow.com/a/16819696/3150802) which explains the issue. The bottom line is that char had better been 32 bit (and string accordingly a sequence of 32 bit values), but "that train has left the station", as we say in German. – Peter - Reinstate Monica Apr 19 '18 at 15:11
  • Read the appropriate reference or read more closely, [System.Char](https://learn.microsoft.com/en-us/dotnet/api/system.char?view=netframework-4.7.1#remarks). What's UTF-16? is still a valid question, though. Keep reading that reference and it explains, or go to the source [Unicode.org FAQ](https://www.unicode.org/faq/utf_bom.html#utf16-1). – Tom Blodget Apr 22 '18 at 15:35
  • Possible duplicate of [C# and UTF-16 characters](https://stackoverflow.com/questions/697055/c-sharp-and-utf-16-characters) – Tom Blodget Apr 22 '18 at 16:06

1 Answers1

6

In short: Surrogates

This is a good question. Unicode is more complicated, than most people think, because it introduces multiple new concepts (character set, code point, encoding, code unit), but i will try to give an almost complet answer.

Intro:

Unicode is a character set. A character set is just list of character and code point pairs. A code point is just a number to identify the paired character. UTF-8, UTF-16 and UTF-32 are encodings. Encodings define how the numbers (code points) are represented in binary form (as code units). Code units could be made of one or more bytes. (Actually the original ASCII code units are even just 7-bits long, but that's another story )

Remember: character sets are made of code points and encodings are made of code units.

The C# char type represents a UTF-16 character (code unit). UTF-16 is a variable-length / multibyte encoding for the Unicode character set. Meaning characters can be represented by one or two 16-bit code units. Unicode code points beyond 16-bit are represented by two UTF-16 code units which equals four bytes.

Now to answering your question: How?

The original idea of Unicode was 1 character = 1 code point. But the origianl encoding which is UCS-2 (which is now obsolete) uses two bytes (16-bits) and could only encode 65,536 code points. After a short time this was not enough for the growing Unicode character set. Oh really, what the f did they think? Two bytes are obviously not enough. To fix this problem Unicode must step back from the original idea and introduced surrogates.

Therefore the UTF-16 was born, which is a variable-length/multibyte (16-bit code units) encoding which implements surrogates. This surrogates are special 16-bit code units equal to code points defined in Unicode which explicitly are not characters. Finding a surrogate while parsing your text simply means, you also have to read the next 16-bits and interpret both 16-bit units (the surrogate and the subsequent code unit) as one combined 32-bit Unicode code point.

UTF-32 is a fixed four byte encoding, which is big enough to avoid space problems, and could map 1 character on 1 code point, but UTF-32 also has to handle surrogates, since the UTF encodings are based on the Unicode standard and surrogates are part of the definion of the Unicode character set.

UTF-8 is also a variable-length/multibyte encoding but with another interessting encoding technique. In short: The number of leading zeros in a code unit defines the number of up to four subsequent bytes, which have to be combined to one Unicode code point.

Doomjunky
  • 1,148
  • 17
  • 18