24

How can I return the Unicode Code Point of a character? For example, if the input is "A", then the output should be "U+0041". Ideally, a solution should take care of surrogate pairs.

With code point I mean the actual code point according to Unicode, which is different from code unit (UTF8 has 8-bit code units, UTF16 has 16-bit code units and UTF32 has 32-bit code units, in the latter case the value is equal to the code point, after taking endianness into account).

svick
  • 236,525
  • 50
  • 385
  • 514
FSm
  • 2,017
  • 7
  • 29
  • 55
  • 2
    This question is severely misworded. “Returning the ‘Unicode’ of a character” has no meaning, and frankly, is nonsense. Your example makes clear what you actually want, but the title needs to be reworked. Please do so. – tchrist Dec 15 '12 at 17:13
  • 1
    Thanks. I have given you my upvote in appreciation. – tchrist Dec 15 '12 at 17:33

7 Answers7

15

The following code writes the codepoints of a string input to the console:

string input = "\uD834\uDD61";

for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1)
{
    var codepoint = char.ConvertToUtf32(input, i);

    Console.WriteLine("U+{0:X4}", codepoint);
}

Output:

U+1D161

Since strings in .NET are UTF-16 encoded, the char values that make up the string need to be converted to UTF-32 first.

dtb
  • 213,145
  • 36
  • 401
  • 431
  • 1
    That doesn't convert to UTF-32 but returns the code point as integer, UTF-32 is an encoding, not an integer. This method naming propagates same confusion as microsoft labeling the UTF-16LE encoding as "unicode" – Esailija Dec 15 '12 at 17:03
  • 2
    @Esailija: I wasn't sure what is more confusing: converting to a Unicode code point using a method named `ConvertToUtf32`, or converting to UTF-32 and treating the result as Unicode code point. In the end that's probably splitting hairs. – dtb Dec 15 '12 at 17:08
  • 1
    you can't treat the result of converting to actual UTF-32 as code point, you need to decode the code points from the encoding, just like you would decode from UTF-16 or UTF-8, except simpler. But I can see why this would be seen nitpicky :P – Esailija Dec 15 '12 at 17:12
12

Easy, since chars in C# is actually UTF16 code points:

char x = 'A';
Console.WriteLine("U+{0:x4}", (int)x);

To address the comments, A char in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair and Char.ConvertToUtf32, as suggested in another answer:

string input = ....
for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
{
    int x = Char.ConvertToUtf32(input, i);
    Console.WriteLine("U+{0:X4}", x);
}
driis
  • 161,458
  • 45
  • 265
  • 341
  • 9
    They are unicode code units, not code points. What about characters that require more than one code unit? – President James K. Polk Dec 15 '12 at 16:37
  • @driis... Same as GregS comment – FSm Dec 15 '12 at 16:39
  • @GregS: Can a `char` actually hold a character that requires more than one code unit? – dtb Dec 15 '12 at 16:41
  • @GregS: Please see updated answer. My solution yields exactly the same result as the other (upvoted) answer, it just doesn't jump through as many hoops to get there. – driis Dec 15 '12 at 16:46
  • 2
    @driis: I didn't downvote you, I was just offering a clarifying point. – President James K. Polk Dec 15 '12 at 17:02
  • @dtb: no. I meant Unicode characters, not Char characters. I hate the whole Unicode terminology as it seems designed to confuse people. I still think this answer has "point" and "unit" swapped. – President James K. Polk Dec 15 '12 at 17:03
  • 2
    @Qaesar lower case a (`'a'`) is `U+0061`, uppercase a (`'A'`) is `U+0041` – Esailija Dec 15 '12 at 17:15
  • @GregS A codepoint is an abstract, logical character, one divorced from its low-level physical layout. 99.99% of programmers want to work only with logical characters, not individual physical constituent components that are laid out differently on different sytsems. That means that a code unit is the ugly thing you never want to deal with. You only want to deal with code points. – tchrist Dec 15 '12 at 17:18
  • @ All, I just want to know, which of ASCII or code point the processor considers it when it looks up for the letters?? I:m really getting confused. Thank you. – FSm Dec 15 '12 at 17:22
  • 2
    Sorry if we are confusing you. The problem is Unicode encodings is actually a bit complex even though they might not seem so at first glance. The code in this answer, or the one @dtb posted, will work fine for you. I can recommend http://www.joelonsoftware.com/articles/Unicode.html if you want some more background. – driis Dec 15 '12 at 17:31
  • @ driis. I have to say sorry because I bothered you . Your Kind action really appreciated. Many Thanks. – FSm Dec 15 '12 at 17:48
10

In .NET Core 3.0 or later, you can use the Rune Struct:

// Note that  and  are encoded using surrogate pairs
// but A, B, C and ✋ are not
var runes = "ABC✋".EnumerateRunes();

foreach (var r in runes)
    Console.Write($"U+{r.Value:X4} ");
        
// Writes: U+0041 U+0042 U+0043 U+270B U+1F609 U+1F44D
DigitalDan
  • 2,477
  • 2
  • 28
  • 35
4

C# cannot store unicode codepoints in a char, as char is only 2 bytes and unicode codepoints routinely exceed that length. The solution is to either represent a codepoint as a sequence of bytes (either as a byte array or "flattened" into a 32-bit primitive) or as a string. The accepted answer converts to UTF32, but that's not always ideal.

This is the code we use to split a string into its unicode codepoint components, but preserving the native UTF-16 encoding. The result is an enumerable that can be used to compare (sub)strings natively in C#/.NET:

    public class InvalidEncodingException : System.Exception
    { }

    public static IEnumerable<string> UnicodeCodepoints(this string s)
    {
        for (int i = 0; i < s.Length; ++i)
        {
            if (Char.IsSurrogate(s[i]))
            {
                if (s.Length < i + 2)
                {
                    throw new InvalidEncodingException();
                }
                yield return string.Format("{0}{1}", s[i], s[++i]);
            }
            else
            {
                yield return string.Format("{0}", s[i]);
            }
        }
    }
}
Mahmoud Al-Qudsi
  • 28,357
  • 12
  • 85
  • 125
2

Actually there is some merit in @Yogendra Singh 's answer, currently the only one with negative voting. The job can be done like this

    public static IEnumerable<int> Utf8ToCodePoints(this string s)
    {
        var utf32Bytes = Encoding.UTF32.GetBytes(s);
        var bytesPerCharInUtf32 = 4;
        Debug.Assert(utf32bytes.Length % bytesPerCharInUtf32 == 0);
        for (int i = 0; i < utf32bytes.Length; i+= bytesPerCharInUtf32)
        {
            yield return BitConverter.ToInt32(utf32bytes, i);
        }
    }

Tested with

    var surrogatePairInput = "abc";
    Debug.Assert(surrogatePairInput.Length == 5);
    var pointsAsString = string.Join(";" , 
        surrogatePairInput
        .Utf8ToCodePoints()
        .Select(p => $"U+{p:X4}"));
    Debug.Assert(pointsAsString == "U+0061;U+0062;U+0063;U+1F4A9");

Example is relevant because the pile of poo is represented as a surrogate pair.

Călin Darie
  • 5,937
  • 1
  • 18
  • 13
  • As a point of improvement rather than getting the utf8 bytes and then converting them to utf32 you could just get the utf32 bytes in the first place. – Chris Jun 21 '17 at 15:31
  • Also the reason that the answer you mentioned has a negative score is that the method only accepts a `char` as a parameter which means it could never give you more than two bytes of information. Yours is a vast improvement because you actually parse a string and not a char. – Chris Jun 21 '17 at 15:33
  • Thanks @Chris. I simplified the method. – Călin Darie Jun 22 '17 at 19:32
-1

I found a little method on msdn forum. Hope this helps.

    public int get_char_code(char character){ 
        UTF32Encoding encoding = new UTF32Encoding(); 
        byte[] bytes = encoding.GetBytes(character.ToString().ToCharArray()); 
        return BitConverter.ToInt32(bytes, 0); 
    } 
Yogendra Singh
  • 33,927
  • 6
  • 63
  • 73
  • 4
    Does this ever return something different than `(int)character`? What happens if `character` is one half of a surrogate pair? – dtb Dec 15 '12 at 16:49
  • @dtb (very late answer, I know). The interesting thing of this code is that it shows using `UTF32Encoding`, but since the method only takes a `char`, it has no effect and is the same as `(int) character`, though much slower than a cast. In fact, `character.ToString().ToCharArray()` will always return an array of one item (size 2 bytes), and the `BitConverter` will never return a value > 65535. Nice idea in principle, but useless in the way it is presented. – Abel Sep 12 '17 at 18:10
-1
public static string ToCodePointNotation(char c)
{

    return "U+" + ((int)c).ToString("X4");
}

Console.WriteLine(ToCodePointNotation('a')); //U+0061
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • @Qaesar lower case a (`'a'`) is `U+0061`, uppercase a (`'A'`) is `U+0041` – Esailija Dec 15 '12 at 17:16
  • You should throw an exception if `Char.IsSurrogate(c)` because such a code unit cannot be considered a codepoint value and therefore doesn't have a codepoint notation. – Tom Blodget Feb 04 '16 at 17:55
  • 1
    This answer is simply not correct, you cannot presume there exists a one-to-one mapping between a C# `char` and a UTF-16 codepoint because there is none. – Mahmoud Al-Qudsi Apr 07 '17 at 14:15