7

Is it possible in C# to use UTF-32 characters not in Plane 0 as a char?

string s = ""; // valid
char c = ''; // generates a compiler error ("Too many characters in character literal")

And in s it is represented by two characters, not one.

Edit: I mean, is there a character AN string type with full unicode support, UTF-32 or UTF-8 per character? For example if I want a for loop on utf-32 (maybe not in plane0) characters in a string.

Dutow
  • 5,638
  • 1
  • 30
  • 40

3 Answers3

10

The string class represents a UTF-16 encoded block of text, and each char in a string represents a UTF-16 code value.

Although there is no BCL type that represents a single Unicode code point, there is support for Unicode characters beyond Plane 0 in the form of method overloads taking a string and an index instead of just a char. For example, the static GetUnicodeCategory(char) method on the System.Globalization.CharUnicodeInfo class has a corresponding GetUnicodeCategory(string,int) method that will recognize a simple character or a surrogate pair starting at the specified index.


To iterate through the text elements in a string, you can use the methods on the System.Globalization.StringInfo class. Here, a "text element" corresponds to a single character as displayed on screen. This means that simple characters ("a"), combining characters ("a\u0304\u0308" = "ā̈"), and surrogate pairs ("\uD950\uDF21" = "") will all be treated as a single text element.

Specifically, the GetTextElementEnumerator static method will allow you to enumerate over each text element in a string (see the linked MSDN page for a code example).

Emperor XLII
  • 13,014
  • 11
  • 65
  • 75
  • 1
    Good presentation of the facts. Note that C# allows you to use `"\U00064321"` (exactly eight hexadecimal digits after the `\U`) which is equivalent to `"\uD950\uDF21"` but easier to "understand" from a Unicode/UTF-32 point of view. This is a code point in [plane 6](https://en.wikipedia.org/wiki/Plane_(Unicode)#Unassigned_planes). – Jeppe Stig Nielsen Oct 26 '15 at 11:49
  • 1
    While this is correct at the time of writing, I'd like to add for future reference that there is now a type which represents a UNICODE Scalar Value: Rune. – Patrick Kelly Jan 18 '20 at 16:25
4

I only know this problem from Java and checked the documentation on char before answering and indeed the behavior is pretty much the same in .NET/C# and Java.

It seems that indeed a char is defined to be 16 bit and definitely can't hold anything outside of Plane 0. Only String/string is capable of handling those characters. In a char-array it will be represented as two surrogate characters.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
3

C# System.String support UTF-32 just fine, but you can't iterate through the string like it is an array of System.Char or use IEnumerable.

for example:

// iterating through a string NO UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
    if (Char.IsDigit(sample[i]))
    {
        Console.WriteLine("IsDigit");
    }
    else if (Char.IsLetter(sample[i]))
    {
        Console.WriteLine("IsLetter");
    }
}

// iterating through a string WITH UTF-32 SUPPORT
for (int i = 0; i < sample.Length; ++i)
{
    if (Char.IsDigit(sample, i))
    {
        Console.WriteLine("IsDigit");
    }
    else if (Char.IsLetter(sample, i))
    {
        Console.WriteLine("IsLetter");
    }

    if (Char.IsSurrogate(sample, i))
    {
        ++i;
    }
}

Note the subtle difference in the Char.IsDigit and Char.IsLetter calls. And that String.Length is always the number of 16-bit "characters", not the number of "characters" in the UTF-32 sense.

Off topic, but UTF-32 support is completely unnecessary for an application to handle international languages, unless you have a specific business case for an obscure historical/technical language.

  • What you're talking about is not UTF-32, it's just UTF-16 that happens to contain supplemental characters. In UTF-32, every character is stored as four bytes. .NET strings are always UTF-16. – Alan Moore May 09 '09 at 19:59
  • 1
    Instead of "with UTF-32 support", the example should probably read "with surrogate pair support" or "with support for actual characters, not just 16-bit chunks of I-hope-this-char-is-in-the-BMP". – Triynko Apr 13 '11 at 23:00