1

I have a C# method that needs to retrieve the first character of a string, and see if it exists in a HashSet that contains specific unicode characters (all the right-to-left characters).

So I'm doing

var c = str[0];

and then checking the hashset.

The problem is that this code doesn't work for strings where the first char's code point is larger than 65535.

I actually created a loop that goes through all numbers from 0 to 70,000 (the highest RTL code point is around 68,000 so I rounded up), I create a byte array from the number, and use

Encoding.UTF32.GetString(intValue);

to create a string with this character. I then pass it to the method that searches in the HashSet, and that method fails, because when it gets

str[0]

that value is never what it should be.

What am I doing wrong?

RBarryYoung
  • 55,398
  • 14
  • 96
  • 137
user884248
  • 2,134
  • 3
  • 32
  • 57
  • 2
    Do you mean get the first `TextElement`? A `char` cannot have a value greater than `65535`. However, a Unicode Character can. – Jodrell Oct 18 '16 at 16:26
  • This seems relevant: http://stackoverflow.com/questions/16816528/using-unicode-characters-bigger-than-2-bytes-with-net – RBarryYoung Oct 18 '16 at 16:33
  • To summarize (I think) .Net strings are UTF-16, but true unicode requires UTF-32. So a lot of messiness has to happen to adjust to this... – RBarryYoung Oct 18 '16 at 16:34
  • 1
    To clarify, `UTF-8`, `UTF-16` and `UTF-32` are encodings of Unicode. A confusion arises becuase `Unicode` is used as an alias for `UTF-16` in the framework. – Jodrell Oct 18 '16 at 16:37

3 Answers3

6

A String is a sequence of UTF-16 code units, one or two encode a Unicode codepoint. If you want to get a codepoint from a string, you have to iterate codepoints in the string. A "character" is also a base codepoint followed by a sequence of zero or more combining codepoints ("combining characters").

// Use a HashSet<String>

var itor = StringInfo.GetTextElementEnumerator(s);
while (itor.MoveNext()) {
    var character = itor.GetTextElement();
    // find character in your HashSet
}

If you don't need to consider combining codepoints, you can wipe them out. (But they are very significant in some languages.)

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72
  • Thanks Tom, you're the only one who understood what I meant. Thanks for the solution, I'll try I out tomorrow. – user884248 Oct 18 '16 at 17:43
1

To anyone who sees this question in the future and is interested in the solution I ended up with - this is my method which decides if a string should be displayed RTL or LTR based on the first character in the string. It takes UTF-16 Surrogate Pairs into account.

Thanks to Tom Blodget who pointed me in the right direction.

if (string.IsNullOrEmpty(str)) return null;

var firstChar = str[0];
if (firstChar >= 0xd800 && firstChar <= 0xdfff)
{
    // if the first character is between 0xD800 - 0xDFFF, this is the beginning
    // of a UTF-16 surrogate pair. there MUST be one more char after this one,
    // in the range 0xDC00-0xDFFF. 
    // for the very unreasonable chance that this is a corrupt UTF-16 string
    // and there is no second character, validate the string length
    if (str.Length == 1) return FlowDirection.LeftToRight;

    // convert surrogate pair to a 32 bit number, and check the codepoint table
    var highSurrogate = firstChar - 0xd800;
    var lowSurrogate = str[1] - 0xdc00;
    var codepoint = (highSurrogate << 10) + (lowSurrogate) + 0x10000;

    return _codePoints.Contains(codepoint)
        ? FlowDirection.RightToLeft
        : FlowDirection.LeftToRight;
}
return _codePoints.Contains(firstChar)
    ? FlowDirection.RightToLeft
    : FlowDirection.LeftToRight;
user884248
  • 2,134
  • 3
  • 32
  • 57
  • For anyone finding this answer, please do not use magic numbers to do calculations on Unicode code units. Instead, use methods such as `Char.IsHighSurrogate()`, `Char.IsLowSurrogate()`, and `Char.IsSurrogatePair()` to check for the desired attributes. – AndOne Jan 19 '23 at 07:16
0

I'm not sure I understand your question, a short chunk of code might be useful. When you have a line like 'var c = str[0]', assuming 'str' is a string, then c will be a char, which is encoded UTF16. Because of this c will never be greater than (2^16 - 1). Unicode characters can be larger than that but when that occurs they are encoded to span multiple 'character' positions. In the case of UTF-16 the 'first' character may occupy 1 or 2 16 bit values.

Dweeberly
  • 4,668
  • 2
  • 22
  • 41