0

I'm getting confused about C# UTF8 encoding...

Assuming those "facts" are right:

  1. Unicode is the "protocol" which define each character.
  2. UTF-8 define the "implementation" - how to store those characters.
  3. Unicode define character range from 0x0000 to 0x10FFFF (source)

According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?

In contrast to C#, when I using Python for writing UTF8 text - it's covering all the expected range (0x0000 to 0x10FFFF). For example:

u"\U00010000"  #WORKING!!!

which isn't working for C#. What's more, when I writing the string u"\U00010000" (single character) in Python to text file and then read it from C#, this single character document became 2 characters in C#!

# Python (write):
import codecs                        
with codes.open("file.txt", "w+", encoding="utf-8") as f:                        
    f.write(text) # len(text) -> 1

// C# (read): 
string text = File.ReadAllText("file.txt", Encoding.UTF8); // How I read this text from file.
Console.Writeline(text.length); // 2

Why? How to fix?

No1Lives4Ever
  • 6,430
  • 19
  • 77
  • 140
  • Well for a start, `char` is a 16 bit value meaning it could only store up to 0xFFFF. If you want to use characters form above that range, you need a `string` and 2 chars together. – DavidG Sep 01 '17 at 09:55
  • *"when I using Python for writing UTF8 text"* - are you sure it's the same UTF8 as in C# ? Check this first, it should be enough to use same endcoding, disregards of language. FYI, [`Encoding.UTF8`](https://msdn.microsoft.com/en-us/library/system.text.encoding.utf8(v=vs.110).aspx) != [`Encoding.Unicode`](https://msdn.microsoft.com/en-us/library/system.text.encoding.unicode(v=vs.110).aspx) (see [here](https://stackoverflow.com/q/643694/1997232)), perhaps you just want to use latter one? – Sinatr Sep 01 '17 at 10:00
  • @DavidG, What you say here is the C# not support the Unicode protocol. Because obviously two latters are not equal to one latter. It's just not the same. – No1Lives4Ever Sep 01 '17 at 10:04
  • @Sinatr, I checked it twice. I using `Encoding.UTF8`. Added my source to main thread. – No1Lives4Ever Sep 01 '17 at 10:05
  • Can you read in python files made by C# in UTF8? But not opposite? I blame python. – Sinatr Sep 01 '17 at 10:10
  • @Sinatr, Yes I can. This is because C# optional characters (0x0000 to 0xFFFF) are subset of Python optional characters (0x0000 to 0x10FFFF). According to Unicode documentation, Python is right. – No1Lives4Ever Sep 01 '17 at 10:14
  • 1
    It is not fully correct. Unicode is not the protocol and UTF8 is not implementation. Uncode is code table, same like ASCII but 16-bit. UTF8 is method to fit Unicode as 8-bit ASCII extension. You can use Unicode in 16-bit form and never need UTF8. The big problem of Unicode is that it has to be 32-bit in the beginning, not 16-bit. Because 16-bit limit was reached and there are tricks like UTF16 and other scrap to make Unicode 32-bit and keep compatibility. – i486 Sep 01 '17 at 10:15
  • @i486 is completely on-spot. Encoding is complicated to say the least. Many people use wrong terminology so they confuse all the concepts related to it (and I myself am no exception, I still haven't got the hang of it completely). Unicode is just an abstraction over "characters" in the most liberal sense of that word. In itself it's got nothing to do with encoding, much less with bits and bytes. UTF-* are encodings, specifically of Unicode. – MarioDS Sep 01 '17 at 10:27
  • I accept all your concept, but still not accept why C# have special edition of UTF8. It's not the standard! – No1Lives4Ever Sep 01 '17 at 10:30
  • What is the content of `file.txt` in HEX? – i486 Sep 01 '17 at 10:43
  • One `char` in C# can hold values between 0 and 0xFFFF. When you have 0x10000 or higher, then it becomes UTF16 and is represented as 2 `char`-s. – i486 Sep 01 '17 at 10:52
  • @No1Lives4Ever C# does not have a "special edition" of UTF-8, it's just that you expect the `String.Length` function to do something differently than what it actually does. See: https://stackoverflow.com/q/26975736/1313143 – MarioDS Sep 01 '17 at 11:01
  • Possible duplicate of [Why is the length of this string longer than the number of characters in it?](https://stackoverflow.com/questions/26975736/why-is-the-length-of-this-string-longer-than-the-number-of-characters-in-it) – MarioDS Sep 01 '17 at 11:02
  • Possible duplicate of [Python C# - Unicode character is not the same on Python and C#](https://stackoverflow.com/questions/45963954/python-c-sharp-unicode-character-is-not-the-same-on-python-and-c-sharp) – Mark Tolonen Sep 02 '17 at 04:15
  • Try `len(u"\U00010000")` on a Python older than 3.3. It has the same leaky abstraction as Java and C#. – Mark Tolonen Sep 02 '17 at 04:23

3 Answers3

5

According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?

Unfortunately, a C#/.NET char does not represent a Unicode character.

A char is a 16-bit value in the range 0x0000 to 0xFFFF which represents one “UTF-16 code unit”. Characters in the ranges U+0000–U+D7FF and U+E000–U+FFFF, are represented by the code unit of the same number so everything's fine there.

The less-often-used other characters, in the range U+010000 to U+10FFFF, are squashed into the remaining space 0xD800–0xDFFF by representing each character as two UTF-16 code units together, so the equivalent of the Python string "\U00010000" is C# "\uD800\uDC00".

Why?

The reason for this craziness is that the Windows NT series itself uses UTF-16LE as the native string encoding, so for interoperability convenience .NET chose the same. WinNT chose that encoding—at the time thought of as UCS-2 and without any of the pesky surrogate code unit pairs—because in the early days Unicode only had characters up to U+FFFF, and the thinking was that was going to be all anyone was going to need.

How to fix?

There isn't really a good fix. Some other languages that were unfortunate enough to have based their string type on UTF-16 code units (Java, JavaScript) are starting to add methods to their strings to do operations on them counting a code point at a time; but there is no such functionality in .NET at present.

Often you don't actually need to consistently need to count/find/split/order/etc strings using proper code point items and indexes. But when you really really do, in .NET, you're in for a bad time. You end up having to re-implement each normally-trivial method by manually walking over each char and check it for being part of a two-char surrogate pair, or converting the string to an array of codepoint ints and back. This isn't a lot of fun, either way.

A more elegant and altogether more practical option is to invent a time machine, so we can send the UTF-8 design back to 1988 and prevent UTF-16 from ever having existed.

bobince
  • 528,062
  • 107
  • 651
  • 834
2

Unicode has so-called planes (wiki).

As you can see, C#'s char type only supports the first plane, plane 0, the basic multilingual plane.

I know for a fact that C# uses UTF-16 encoding, so I'm a bit surprised to see that it doesn't support code points beyond the first plane in the char datatype. (haven't run into this issue myself...).

This is an artificial restriction in char's implementation, but one that's understandable. The designers of .NET probably didn't want to tie the abstraction of their own character datatype to the abstraction that Unicode defines, in case that standard would not survive (it already superseded others). This is just my guess of course. It just "uses" UTF-16 for memory representation.

UTF-16 uses a trick to squash code points higher than 0xFFFF into 16 bits, as you can read about here. Technically those code points consist of 2 "characters", the so-called surrogate pair. In that sense it breaks the "one code point = one character" abstraction.

You can definitely get around this by working with string and maybe arrays of char. If you have more specific problems, you can find plenty of information on StackOverflow and elsewhere about working with all of Unicode's code points in .NET.

MarioDS
  • 12,895
  • 15
  • 65
  • 121
  • Any reference to statement regarding plane 0? OP is talking about UTF8, you are talking about UTF16 and some planes. – Sinatr Sep 01 '17 at 10:08
  • @Sinatr it's important to recognize there are planes, because all code points in the suplementary planes are encoded using this surrogate pair trick. C# as a language uses UTF-16 to store all characters and strings in memory. Reading a file as UTF-8 is just an instruction on how to interpret the file, not on how to store the contents of that file in memory, which it still does as UTF-16. A conversion from an UTF-8 encoded file to an UTF-16 encoded memory representation thus takes place. – MarioDS Sep 01 '17 at 10:14
  • @Sinatr it is then observed that C#'s `char` implementation does not hide the fact that a code point in one of the supplementary planes of Unicode is encoded as a *pair*, the *surrogate pair*. – MarioDS Sep 01 '17 at 10:16
  • I am aware about [surrogate pairs](http://csharpindepth.com/Articles/General/Unicode.aspx). You wrote *"C#'s char type only supports the first plane, plane 0"* and I doubt that statement and how does your answer answers OP question. – Sinatr Sep 01 '17 at 10:21
  • @Sinatr it does so because the documentation says so. Not in the same words, but as OP points out, it accepts a range from `U+0000` to `U+FFFF` which is exactly the BMP and nothing more. How does it answer the question? => OP wants to know why his C# says his seemingly one-letter string consists of 2 `char`s. This is why, and it's because UTF-16 encodes that one visual character as 2 characters and hence uses 32 bits. – MarioDS Sep 01 '17 at 10:24
0

I've found that if you simply copy and paste the unicode symbol into a C# text string, it displays correctly when the app runs. This character () is u128316 but can be copied into a string from a site such as https://www.amp-what.com/

iStuart
  • 413
  • 4
  • 6