1

I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?

This question was marked as a possible duplicate to Determine a string's encoding in C# It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.

Community
  • 1
  • 1
user2481095
  • 2,024
  • 7
  • 22
  • 32
  • 1
    Just to be clear, are you concerned about the characters which might be surrogate pairs? If everything's in the BMP, you can just use `foreach (char c in text) { Console.WriteLine((int) c); }` – Jon Skeet May 11 '16 at 18:37
  • 2
    Use Char.IsHighSurrogate and Char.IsLowSurrogate if you're not sure. – glenebob May 11 '16 at 18:42
  • What type of characters are not represented by the BMP? – user2481095 May 11 '16 at 18:51
  • Try [Character Classes in Regular Expressions](https://msdn.microsoft.com/en-us/library/20bw873z%28v=vs.110%29.aspx). Note the _General Categories_ and _Named Blocks_. – Alexander Petrov May 11 '16 at 20:18
  • 2
    svar.ToCharArray() in a quick watch expression is a good way, especially when you change the display format to hexadecimal. – Hans Passant May 11 '16 at 20:42

2 Answers2

1

The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.

PamZy
  • 123
  • 1
  • 14
  • 1
    The days you can get away with just the BMP have been gone for a while. A popular class of characters outside the BMP are emoji ☹️. – roeland May 12 '16 at 02:24
0

Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).

In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:

Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.

The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:

The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.

Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:

static IEnumerable<int> GetCodePoints(string s)
{
    for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
    {
        yield return char.ConvertToUtf32(s, i);
    }
}

When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:

GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));

Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).

binki
  • 7,754
  • 5
  • 64
  • 110