9

If I have a string like "123‍‍‍", how can I split it into an array, which would look like ["", "1", "2", "3", "‍‍‍"]? If I use ToCharArray() the first Emoji is split into 2 characters and the second into 7 characters.

Update

The solution now looks like this:

public static List<string> GetCharacters(string text)
{
    char[] ca = text.ToCharArray();
    List<string> characters = new List<string>();
    for (int i = 0; i < ca.Length; i++)
    {
        char c = ca[i];
        if (c > ‭65535‬) continue;
        if (char.IsHighSurrogate(c))
        {
            i++;
            characters.Add(new string(new[] { c, ca[i] }));
        }
        else
            characters.Add(new string(new[] { c }));
    }
    return characters;
}

Please note that, as mentioned in the comments, it doesn't work for the family emoji. It only works for emojis that have 2 characters or less. The output of the example would be: ["", "1", "2", "3", "‍", "‍", "‍", ""]

mjw
  • 400
  • 4
  • 20
  • 2
    `‍+‍+‍+ = ‍‍‍` funny, didn't know that – fubo Feb 14 '17 at 13:36
  • 1
    How did this happen? Emoji is for the text rendering engine. Processing text that contains emoji is roughly equivalent to the joy of processing Chinese text. Or Zalgo, if you want a real challenge :) Recognizing surrogates isn't otherwise rocket science, use Char.IsLowSurrogate(). – Hans Passant Feb 14 '17 at 13:37

2 Answers2

5

.NET represents strings as a sequence of UTF-16 elements. Unicode code points outside the Base Multilingual Plane (BMP) will be split into a high and low surrogate. The lower 10 bits of each forms half of the real code point value.

There are helpers to detect these surrogates (eg. Char.IsLowSurrogate).

You need to handle this yourself.

Richard
  • 106,783
  • 21
  • 203
  • 265
2

There is a solution which seems to work for the input you specified:

static string[] SplitIntoTextElements(string input)
{
    IEnumerable<string> Helper()
    {
        for (var en = StringInfo.GetTextElementEnumerator(input); en.MoveNext();)
            yield return en.GetTextElement();
    }
    return Helper().ToArray();
}

Try it here.


PS: This solution should work for .NET 5+, the previous .NET versions contain a bug which prevents the correct splitting.

Vlad
  • 35,022
  • 6
  • 77
  • 199