Converting plain text into its corrosponding Unicode value?

Question

I am writing a program that requires me to convert Unicode text into its corresponding Unicode value. Like you would do when converting the letter 'a' into the number on the ASCII table (97 in decimal). Only I would like to know if this can be done in Unicode.

Thanks in advance.

Is this what you want? https://stackoverflow.com/questions/18627694/how-to-insert-a-symbol-pound-euro-copyright-into-a-textbox — Jeremy Thompson, Apr 24 '19 at 00:14
Maybe he needs [`Char.ConvertToUtf32`](https://learn.microsoft.com/en-us/dotnet/api/system.char.converttoutf32?view=netframework-4.7.1)? — Dour High Arch, Apr 24 '19 at 00:21

score 1 · Answer 1 · answered Apr 28 '19 at 15:27

.NET doesn't have a built-in method for iterating letters or character codes in the sense that you ask since they in a middle ground between the character encoding that .NET uses (UTF-16) and graphemes ("user-perceived characters").

UTF-16 encodes each Unicode codepoint in one or two code units (.NET's Char, aliased in C# as char). A String (aliased in C# as string) is a counted sequence of UTF-16 code units.

The Char struct does have some methods that deal with codepoints (as Int32) and some awkward ones that can help iterate codepoints. Note: codepoints are usually written with a U+ prefix and 4 or 5 hexadecimal digits.

The StringInfo class has some methods that iterate graphemes (aka "text elements").

But, since you ask about Unicode character codes ("codepoints"), the UnicodeInformation NuGet package might be the best option.

With it, you can also get the description of each codepoint, as published by Unicode.org. Their website has a lot information, including complete lists of codepoints.

var s = "Put your  repair hobby on your résumé."; 
//  takes two UTF-16 code units. 
// Second é is two codepoints: "e\u0301", base and combining codepoints

var e = StringInfo.GetTextElementEnumerator(s);
while (e.MoveNext())
{
    var grapheme = (String)e.Current;
    Console.WriteLine(grapheme);

    foreach (var codepoint in grapheme.AsCodePointEnumerable())
    {
        var info = UnicodeInfo.GetCharInfo(codepoint);
        Console.WriteLine($"    U+{codepoint:X04} {info.Name} {info.Category}");
    }
}

Also, in case you are not aware, UTF-16 (or its forward-compatible precursor UCS-2) has been the native character encoding in many environments for approx 25 years: VB4/5/6/A/Script, Java, JavaScript, Windows API, NTFS, SQL NCHAR and NVARCAR, ….

score 0 · Accepted Answer · answered Apr 24 '19 at 00:17

0

Try this:

string text = "€ a+…”";
foreach (char c in text)
{
    Console.WriteLine("{0} U+{1:x4} {2}", c, (int)c, (int)c);
}

For each character in the string this displays:

The character
It's unicode character code in hex
It's unicode character code number

answered Apr 24 '19 at 00:17

TedOnTheNet

1,082
1
8
23

Nor for NFD or NKFD sequences. But is that the OP's intention? – Mr Lister Jun 04 '19 at 11:53

Converting plain text into its corrosponding Unicode value?

2 Answers2