27

Updated question ¹

With regards to character classes, comparison, sorting, normalization and collations, what Unicode version or versions are supported by which .NET platforms?

Original question

I remember somewhat vaguely having read that .NET supported Unicode version 3.0 and that the internal UTF-16 encoding is not really UTF-16 but actually uses UCS-2, which is not the same. It seems, for instance, that characters above U+FFFF are not possible, i.e. consider:

string s = "\u1D7D9"; // ("Mathematical double-struck digit one") 

and it stores the string "ᵽ9".

I'm basically looking for definitive references of answers to the following:

  • If it isn't true UTF-16 in .NET, what is it?
  • What version of Unicode is supported by .NET?
  • If recent versions are not supported or planned in the near future, does anybody know of a (non)commercial library or how I can workaround this issue?

¹) I updated the question as with passing time, it seems more appropriate with respect to the answers and to the larger community. I left the original question in place of which parts have been answered in the comments. Also the old UCS-2 (no surrogates) was used in now-ancient 32 bit Windows versions, .NET has always used UTF-16 (with surrogates) internally.

Abel
  • 56,041
  • 24
  • 146
  • 247
  • 1
    What exactly are you trying to do with those characters? Put them in a webpage with ASP.NET? Display them in a WPF or WinForms interface? – Joe Strommen Feb 06 '12 at 15:15
  • 2
    What does "it doesn't seem to work" mean in this context? – Gabe Feb 06 '12 at 15:47
  • @JoeStrommen: we're implementing a new XML-based data transformation toolset, and I'm trying to found out whether I can say "we support Unicode up to 6.0" or whether we should say something else. In addition, I'm trying to find out how we could bypass possible limitations in .NET. – Abel Feb 06 '12 at 15:52
  • @Gabe: I updated my question, hopefully it's clearer now. – Abel Feb 06 '12 at 15:56
  • Oh, it looks like you were just using the wrong escape mechanism in C# -- it has nothing to do with .NET. Your string was interpreted as "\u1D7D" + "9". You just need "\U0001D7D9". – Gabe Feb 06 '12 at 16:01
  • @Gabe: indeed, I wasn't aware of `\U` (never needed it before I guess) and then wrongly concluded that there was no support for higher planes. – Abel Feb 06 '12 at 16:25

4 Answers4

19

Internally, .NET is UTF-16. In some cases, e.g. when ASP.NET writes to a response, by default it uses UTF-8. Both of them can handle higher planes.

The reason people sometimes refer to .NET as UCS2 is (I think, because I see few other reasons) that Char is strictly 16 bit and a single Char can't be used to represent the upper planes. Char does, however, have static method overloads (e.g. Char.IsLetter) that can operate on high plane UTF-16 characters inside a string. Strings are stored as true UTF-16.

You can address high Unicode codepoints directly using uppercase \U - e.g. "\U0001D7D9" - but again, only inside strings, not chars.

As for Unicode version, from the MSDN documentation:

"In the .NET Framework 4, sorting, casing, normalization, and Unicode character information is synchronized with Windows 7 and conforms to the Unicode 5.1 standard."

Update 1: It's worth noting, however, that this does not imply that the entirety of Unicode 5.1 is supported - neither in Windows 7 nor in .NET 4.0

Windows 8 targets Unicode 6.0 - I'm guessing that .NET Framework 4.5 might synchronize with that, but have found no sources confirming it. And once again, that doesn't mean the entire standard is implemented.

Update 2: This note on Roslyn confirms that the underlying platform defines the Unicode support for the compiler, and in the link to the code it explains that C# 6.0 supports Unicode 6.0 and up (with a breaking change for C# identifiers as a result).

Update 3: Since .NET version 4.5 a new class SortVersion is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion. On the same page, Microsoft explains that .NET 4.0 supports Unicode 5.0 on all platforms and .NET 4.5 supports Unicode 5.0 on Windows 7 and Unicode 6.0 on Windows 8. This slightly contrasts the official "what is new" statement here, which talks of version 5.x and 6.0 respectively. From my own (editor: Abel) experience, in most cases it seems that in .NET 4.0, Unicode 5.1 is supported at least for character classes, but I didn't test sorting, normalization and collations. This seems in line with what is said in MSDN as quoted above.

Abel
  • 56,041
  • 24
  • 146
  • 247
JimmiTh
  • 7,389
  • 3
  • 34
  • 50
  • 1
    Good observation about `char`. I notice indeed that `char uni = "\U0002B740".ToCharArray()[0];` shows "55405", which is only one half of the UTF-16 surrogate pair. It follows from your reference that trying Char.IsLetter on `\u0526` (incorrectly) shows `false`, because it was only introduced with Unicode 6. – Abel Feb 06 '12 at 16:20
  • 1
    (accepting this because you showed the reference I was looking for and too stupid to find at is obvious location, however, the other answers are valuable in their own right) – Abel Feb 06 '12 at 16:24
  • 1
    This might be a helpful point of origin for getting information for single characters: [MSDN link](http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx). Since char cannot contain more than one half, the StringInfo methods return a string instead, with the complete UTF-16 pair (if the character *is* a pair - otherwise it just returns the single char - as a string, or character + combining characters for combining diacritics). – JimmiTh Feb 06 '12 at 16:41
  • This makes much more sense now. The C# Language Spec considers char an unsigned 16-bit **integral type**. So it would seem it that it was designed to have a fixed-width, which would explain its lack of support for UTF-16 surrogates. – Nicholas Miller Dec 19 '18 at 15:00
  • "*Since .NET version 4.5 a new class SortVersion is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion*" -- `SortVersion.FullVersion` isn't static – canton7 Dec 11 '20 at 09:59
  • For the mapping from .NET platforms to unicode standards see https://learn.microsoft.com/en-us/dotnet/api/system.string?redirectedfrom=MSDN&view=net-6.0#strings-and-the-unicode-standard – Varun Mathur Nov 15 '22 at 17:07
5

That character is supported. One thing to note is that for unicode characters with more than 2 bytes, you must declare them with an uppercase '\U', like this:

string text = "\U0001D7D9"

If you create a WPF app with that character in a text block, it should render the double-one character perfectly.

Joe Strommen
  • 1,236
  • 10
  • 18
  • 1
    One more thing: read http://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx for a description of how >2-byte chars are represented in a string. – Joe Strommen Feb 06 '12 at 15:44
4

MSDN covers it briefly here: http://msdn.microsoft.com/en-us/library/9b1s4yhz(v=vs.90).aspx

I tried this:

    static void Main(string[] args) {
        string someText = char.ConvertFromUtf32(0x1D7D9);
        using (var stream = new MemoryStream()) {
            using (var writer = new StreamWriter(stream, Encoding.UTF32)) {
                writer.Write(someText);
                writer.Flush();
            }
            var bytes = stream.ToArray();
            foreach (var oneByte in bytes) {
                Console.WriteLine(oneByte.ToString("x"));
            }
        }
    }

And got a dump of a byte array containing a correct BOM and the correct representation of the \u1D7D9 codepoint, for these encodings:

  • UTF8
  • UTF32
  • Unicode (UTF-16)

So my guess is that higher planes are supported, and that UTF-16 is really UTF-16 (and not UCS-2)

Anders Marzi Tornblad
  • 18,896
  • 9
  • 51
  • 66
  • Thanks for showing an easy approach. It seems indeed to be UTF-16 and not UCS-2 (anymore?). The character and all its encodings is here: http://www.fileformat.info/info/unicode/char/1d7d9/index.htm – Abel Feb 06 '12 at 16:08
  • Btw, I read that reference but didn't find definitive information about what version was supported of Unicode. – Abel Feb 06 '12 at 16:26
0

.NET Framework 4.6 and 4.5 and 4 and 3.5 and 3.0 - The Unicode Standard, version 5.0 .NET Framework 2.0 and 1.1 - The Unicode Standard, Version 3.1

The complete answers can be found here under the section Remarks.

petra
  • 2,642
  • 2
  • 20
  • 12
  • See the edits I made to the original answer, it is not as what that MSDN page seems to suggest. In fact, that page only talks about the Unicode character categories, which is not the same in relation to character encoding or supported character ranges, but even those are different between version of the framework and the underlying operating system. See for more info the [MSDN article on SortVersion](https://msdn.microsoft.com/en-us/library/system.globalization.sortversion%28v=vs.110%29.aspx) (but be warned, even that page is not complete). – Abel May 12 '15 at 23:34