C#: read the first char of a string, when that char's unicode value is > 65535

Question

I have a C# method that needs to retrieve the first character of a string, and see if it exists in a HashSet that contains specific unicode characters (all the right-to-left characters).

So I'm doing

var c = str[0];

and then checking the hashset.

The problem is that this code doesn't work for strings where the first char's code point is larger than 65535.

I actually created a loop that goes through all numbers from 0 to 70,000 (the highest RTL code point is around 68,000 so I rounded up), I create a byte array from the number, and use

Encoding.UTF32.GetString(intValue);

to create a string with this character. I then pass it to the method that searches in the HashSet, and that method fails, because when it gets

str[0]

that value is never what it should be.

What am I doing wrong?

Do you mean get the first `TextElement`? A `char` cannot have a value greater than `65535`. However, a Unicode Character can. — Jodrell, Oct 18 '16 at 16:26
This seems relevant: http://stackoverflow.com/questions/16816528/using-unicode-characters-bigger-than-2-bytes-with-net — RBarryYoung, Oct 18 '16 at 16:33
To summarize (I think) .Net strings are UTF-16, but true unicode requires UTF-32. So a lot of messiness has to happen to adjust to this... — RBarryYoung, Oct 18 '16 at 16:34
To clarify, `UTF-8`, `UTF-16` and `UTF-32` are encodings of Unicode. A confusion arises becuase `Unicode` is used as an alias for `UTF-16` in the framework. — Jodrell, Oct 18 '16 at 16:37

score 6 · Answer 1 · answered Oct 18 '16 at 16:39

A String is a sequence of UTF-16 code units, one or two encode a Unicode codepoint. If you want to get a codepoint from a string, you have to iterate codepoints in the string. A "character" is also a base codepoint followed by a sequence of zero or more combining codepoints ("combining characters").

// Use a HashSet<String>

var itor = StringInfo.GetTextElementEnumerator(s);
while (itor.MoveNext()) {
    var character = itor.GetTextElement();
    // find character in your HashSet
}

If you don't need to consider combining codepoints, you can wipe them out. (But they are very significant in some languages.)

Thanks Tom, you're the only one who understood what I meant. Thanks for the solution, I'll try I out tomorrow. — user884248, Oct 18 '16 at 17:43

score 1 · Accepted Answer · answered Oct 25 '16 at 10:03

To anyone who sees this question in the future and is interested in the solution I ended up with - this is my method which decides if a string should be displayed RTL or LTR based on the first character in the string. It takes UTF-16 Surrogate Pairs into account.

Thanks to Tom Blodget who pointed me in the right direction.

if (string.IsNullOrEmpty(str)) return null;

var firstChar = str[0];
if (firstChar >= 0xd800 && firstChar <= 0xdfff)
{
    // if the first character is between 0xD800 - 0xDFFF, this is the beginning
    // of a UTF-16 surrogate pair. there MUST be one more char after this one,
    // in the range 0xDC00-0xDFFF. 
    // for the very unreasonable chance that this is a corrupt UTF-16 string
    // and there is no second character, validate the string length
    if (str.Length == 1) return FlowDirection.LeftToRight;

    // convert surrogate pair to a 32 bit number, and check the codepoint table
    var highSurrogate = firstChar - 0xd800;
    var lowSurrogate = str[1] - 0xdc00;
    var codepoint = (highSurrogate << 10) + (lowSurrogate) + 0x10000;

    return _codePoints.Contains(codepoint)
        ? FlowDirection.RightToLeft
        : FlowDirection.LeftToRight;
}
return _codePoints.Contains(firstChar)
    ? FlowDirection.RightToLeft
    : FlowDirection.LeftToRight;

For anyone finding this answer, please do not use magic numbers to do calculations on Unicode code units. Instead, use methods such as `Char.IsHighSurrogate()`, `Char.IsLowSurrogate()`, and `Char.IsSurrogatePair()` to check for the desired attributes. — AndOne, Jan 19 '23 at 07:16

score 0 · Answer 3 · answered Oct 18 '16 at 16:39

I'm not sure I understand your question, a short chunk of code might be useful. When you have a line like 'var c = str[0]', assuming 'str' is a string, then c will be a char, which is encoded UTF16. Because of this c will never be greater than (2^16 - 1). Unicode characters can be larger than that but when that occurs they are encoded to span multiple 'character' positions. In the case of UTF-16 the 'first' character may occupy 1 or 2 16 bit values.

C#: read the first char of a string, when that char's unicode value is > 65535

3 Answers3