How can i tell a string starts with an emoji and get the first emoji in the string, without using regex?

Question

I've seen some answers here providing monstrous regular expressions to get emojis from a string. But is there a more algorithmic approach? I mean, operation systems and browsers parse emoji-containing strings somehow, i doubt its done with regexes?

Wow - harder than you'd expect to just get the first character... Lots of good info here: https://www.meziantou.net/how-to-correctly-count-the-number-of-characters-of-a-string.htm / https://stackoverflow.com/questions/13894021/return-code-point-of-characters-in-c-sharp. StringInfo seems to hold the key: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.substringbytextelements?view=net-7.0 — JohnLBevan, Jan 17 '23 at 11:43

score 2 · Accepted Answer · answered Jan 17 '23 at 12:18

I've knocked up the below extension method / demo; hopefully that's some help.

Caveat: I don't know much about this area; so please don't treat this as gospel; and ensure you test thoroughly before relying on it.

In fact - the reason the regex answer comes up so often is probably because that's currently the best answer, given the complexity.

using System;
using System.Globalization;

public class Demo
{
    void Main()
    {
        var emojiString = " that's an emoji";
        Console.WriteLine(emojiString);
        Console.WriteLine("First actual char is: [{0}]... As chars are only 16 bits, and  is 32", emojiString[0]);
        Console.WriteLine("First char is an emoticon? {0}", emojiString.IsEmoji(0)); 
        Console.WriteLine("Second char is an emoticon? {0}",emojiString.IsEmoji(1)); 
    }
}

public static class UnicodeCodePointExtensions 
{
    // uses StringInfo from the System.Globalization namespace: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo?view=net-7.0
    public static bool IsEmoji(this string inputString, int index) 
    {
        return (new StringInfo(inputString)).IsEmoji(index);
    }
    public static bool IsEmoji(this StringInfo inputString, int index)
    {
        var firstUnicodeChar = inputString.SubstringByTextElements(index, 1); // gets the char at the given index
        var charCode = Char.ConvertToUtf32(firstUnicodeChar, 0); // gets a numeric value for this char; note: we first get the char by index rather than just passing the index as an additional argument here since if there are additional utf32 chars earlier in the string our index would be offset
        return IsEmoticon(charCode) 
        || IsMiscPictograph(charCode)
        || IsTransport(charCode)
        || IsMiscSymbol(charCode)
        || IsDingbat(charCode)
        || IsVariationSelector(charCode)
        || IsSupplemental(charCode)
        || IsFlag(charCode);
    }
    
    // these range values from https://stackoverflow.com/a/36258684/361842
    private static bool IsEmoticon(int charCode) =>
        0x1F600 <= charCode && charCode <= 0x1F64F;
    private static bool IsMiscPictograph(int charCode) =>
        0x1F680 <= charCode && charCode <= 0x1F5FF;
    private static bool IsTransport(int charCode) =>
        0x2600 <= charCode && charCode <= 0x1F6FF;
    private static bool IsMiscSymbol(int charCode) =>
        0x2700 <= charCode && charCode <= 0x26FF;
    private static bool IsDingbat(int charCode) =>
        0x2700 <= charCode && charCode <= 0x27BF;
    private static bool IsVariationSelector(int charCode) =>
        0xFE00 <= charCode && charCode <= 0xFE0F;
    private static bool IsSupplemental(int charCode) =>
        0x1F900 <= charCode && charCode <= 0x1F9FF;
    private static bool IsFlag(int charCode) =>
        0x1F1E6 <= charCode && charCode <= 0x1F1FF;
}

The unicode scalar ranges used in the private methods can be found here: https://stackoverflow.com/a/36258684/361842

Info on how to get the Nth "character" from a string where not all characters are "char"s here: https://www.meziantou.net/how-to-correctly-count-the-number-of-characters-of-a-string.htm

Related MS documentation on the StringInfo class / SubstringByTextElements method here: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.substringbytextelements?view=net-7.0

WENDYN · Answer 2 · 2023-01-29T11:48:27.070

Well that's simple, unicode emojis range is U+1F600..U+1F64F.

C# uses UTF-16 strings, so you just convert the unicode range to actual UTF-16 characters

I couldn't find a calculator so I guess we are going to do this by hand

Assuming you have character U = 0x1F600

We use this formula for range of U+010000 to U+10FFFF

U_ = 0byyyyyyyyyyxxxxxxxxxx  // U - 0x10000
W1 = 0b110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
W2 = 0b110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx

U_ will be 0xF600 (0b1111011000000000) which makes the first word W1 = 0b1101101111011000 (0xD83D) and second word W2 = 0b1101111000000000 (0xDE00).

We do the same thing with U+1F64F and get U_ = 0b1111011001001111, W1 = 0b1101100000111101 (0xD83D), W2 = 0b1101111001001111 (0xDE4F)

You can notice that both first words are the same.

Assuming you have some string s. You check every character of that string if it equals our W1 and if it does you check the next character is in range of 0xDE00 to 0xDE4F.

const int emojiW1 = 0xD83D;
const int emojiW2Start = 0xD83D;
const int emojiW2End = 0xDE4F;

string s = Console.ReadLine();
for(int i = 0; i < s.Length - 1; i++) // `s.Length - 1` because emoji takes two characters
{
    if(s[i] == emojiW1 && s[i + 1] >= emojiW2Start && s[i + 1] <= emojiW2End)
    {
        string emoji = s[i].ToString() + s[i + 1].ToString();
        Console.Write(emoji);
        ++i; //let's skip s[i+1]
    }
}

Notice this doesn't include emoji modifiers (like those for skin colors) but it would be the same process of finding the range and then checking if the characters if the characters are in that range.

score 1 · Answer 3 · answered Jan 17 '23 at 12:54

Lazy variant, seems to work for my purpose. Thanks @JohnLBevan

string FirstEmoji(string s)
{
    var e = StringInfo.GetNextTextElement(s);
    var r = e.EnumerateRunes().First();
    return Rune.IsSymbol(r) ? e : null;
}

var s1 = FirstEmoji("‍‍‍ family");  //‍‍‍
var s2 = FirstEmoji("family ‍‍‍");  //null

How can i tell a string starts with an emoji and get the first emoji in the string, without using regex?

3 Answers3

Linked