13

I have, for example, this Unicode string, which consists of the Cyclone and the Japanese Castle defined in C# and .NET, which uses UTF-16 for its CLR string encoding:

var value = "";

If you check this, you find very quickly that value.Length = 4 because C# uses UTF-16 encoded strings, so for these reasons I can't just loop on each character and get its UTF-32 decimal value: foreach (var character in value) result = (ulong)character;. It begs the question, how can I get the UTF-32 decimal value for each character in any string?

Cyclone should be 127744 and Japanese Castle should be 127983, but I am looking for a general answer that can take any C# string and always produce a UTF-32 decimal value out of each character inside of it.

I've even tried taking a look at Char.ConvertToUtf32, but this seems to be problematic if, for example:

var value = "ac";

This has a length of 6. So, how do I know when a new character begins? For example:

Char.ConvertToUtf32(value, 0)   97  int
Char.ConvertToUtf32(value, 1)   127744  int
Char.ConvertToUtf32(value, 2)   'Char.ConvertToUtf32(value, 2)' threw an exception of type 'System.ArgumentException'   int {System.ArgumentException}
Char.ConvertToUtf32(value, 3)   99  int
Char.ConvertToUtf32(value, 4)   127983  int
Char.ConvertToUtf32(value, 5)   'Char.ConvertToUtf32(value, 5)' threw an exception of type 'System.ArgumentException'   int {System.ArgumentException}

There is also the:

public static int ConvertToUtf32(
    char highSurrogate,
    char lowSurrogate
)

But for me to use this as well I need to figure out when I have surrogate pairs. How can you do that?

Alexandru
  • 12,264
  • 17
  • 113
  • 208
  • 1
    http://stackoverflow.com/questions/5903113/how-to-retrieve-the-unicode-decimal-representation-of-the-chars-in-a-string-cont – MethodMan Aug 21 '15 at 13:35
  • @MethodMan Thanks, the accepted answer there will work but I was hoping there was a more elegant way of doing it in .NET. – Alexandru Aug 21 '15 at 13:43
  • sometimes the most elegant way look and or appears a bit complex.. in regards to code structure – MethodMan Aug 21 '15 at 13:45
  • 1
    @MethodMan That's fine by me. It may help the junior developers learn something new. Thanks for the help! – Alexandru Aug 21 '15 at 14:13

2 Answers2

14

Solution 1

string value = "";
byte[] rawUtf32AsBytes = Encoding.UTF32.GetBytes(value);
int[] rawUtf32 = new int[rawUtf32AsBytes.Length / 4];
Buffer.BlockCopy(rawUtf32AsBytes, 0, rawUtf32, 0, rawUtf32AsBytes.Length);

Solution 2

string value = "";
List<int> rawUtf32list = new List<int>();
for (int i = 0; i < value.Length; i++)
{
    if (Char.IsHighSurrogate(value[i]))
    {
        rawUtf32list.Add(Char.ConvertToUtf32(value[i], value[i + 1]));
        i++;
    }
    else
        rawUtf32list.Add((int)value[i]);
}

Update:

Starting with .NET Core 3.0 we have the Rune struct that represents a UTF32 character:

string value = "ac";
var runes = value.EnumerateRunes();

// writes a:97, :127744, c:99, :127983
Console.WriteLine(String.Join(", ", runes.Select(r => $"{r}:{r.Value}")));
György Kőszeg
  • 17,093
  • 6
  • 37
  • 65
9

Here is an extension method that illustrates one way to do it. The idea is that you can loop through each character of the string, and use char.ConvertToUtf32(string, index) to get the unicode value. If the returned value is larger than 0xFFFF, then you know that the unicode value was composed of a set of surrogate characters, and you can adjust the index value accordingly to skip the 2nd surrogate character.

Extension method:

public static IEnumerable<int> GetUnicodeCodePoints(this string s)
{
    for (int i = 0; i < s.Length; i++)
    {
        int unicodeCodePoint = char.ConvertToUtf32(s, i);
        if (unicodeCodePoint > 0xffff)
        {
            i++;
        }
        yield return unicodeCodePoint;
    }
}

Sample usage:

static void Main(string[] args)
{
    string s = "ac";

    foreach(int unicodeCodePoint in s.GetUnicodeCodePoints())
    {
        Console.WriteLine(unicodeCodePoint);
    }
}
sstan
  • 35,425
  • 6
  • 48
  • 66