I've seen some answers here providing monstrous regular expressions to get emojis from a string. But is there a more algorithmic approach? I mean, operation systems and browsers parse emoji-containing strings somehow, i doubt its done with regexes?
-
1Wow - harder than you'd expect to just get the first character... Lots of good info here: https://www.meziantou.net/how-to-correctly-count-the-number-of-characters-of-a-string.htm / https://stackoverflow.com/questions/13894021/return-code-point-of-characters-in-c-sharp. StringInfo seems to hold the key: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.substringbytextelements?view=net-7.0 – JohnLBevan Jan 17 '23 at 11:43
3 Answers
I've knocked up the below extension method / demo; hopefully that's some help.
Caveat: I don't know much about this area; so please don't treat this as gospel; and ensure you test thoroughly before relying on it.
In fact - the reason the regex answer comes up so often is probably because that's currently the best answer, given the complexity.
using System;
using System.Globalization;
public class Demo
{
void Main()
{
var emojiString = " that's an emoji";
Console.WriteLine(emojiString);
Console.WriteLine("First actual char is: [{0}]... As chars are only 16 bits, and is 32", emojiString[0]);
Console.WriteLine("First char is an emoticon? {0}", emojiString.IsEmoji(0));
Console.WriteLine("Second char is an emoticon? {0}",emojiString.IsEmoji(1));
}
}
public static class UnicodeCodePointExtensions
{
// uses StringInfo from the System.Globalization namespace: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo?view=net-7.0
public static bool IsEmoji(this string inputString, int index)
{
return (new StringInfo(inputString)).IsEmoji(index);
}
public static bool IsEmoji(this StringInfo inputString, int index)
{
var firstUnicodeChar = inputString.SubstringByTextElements(index, 1); // gets the char at the given index
var charCode = Char.ConvertToUtf32(firstUnicodeChar, 0); // gets a numeric value for this char; note: we first get the char by index rather than just passing the index as an additional argument here since if there are additional utf32 chars earlier in the string our index would be offset
return IsEmoticon(charCode)
|| IsMiscPictograph(charCode)
|| IsTransport(charCode)
|| IsMiscSymbol(charCode)
|| IsDingbat(charCode)
|| IsVariationSelector(charCode)
|| IsSupplemental(charCode)
|| IsFlag(charCode);
}
// these range values from https://stackoverflow.com/a/36258684/361842
private static bool IsEmoticon(int charCode) =>
0x1F600 <= charCode && charCode <= 0x1F64F;
private static bool IsMiscPictograph(int charCode) =>
0x1F680 <= charCode && charCode <= 0x1F5FF;
private static bool IsTransport(int charCode) =>
0x2600 <= charCode && charCode <= 0x1F6FF;
private static bool IsMiscSymbol(int charCode) =>
0x2700 <= charCode && charCode <= 0x26FF;
private static bool IsDingbat(int charCode) =>
0x2700 <= charCode && charCode <= 0x27BF;
private static bool IsVariationSelector(int charCode) =>
0xFE00 <= charCode && charCode <= 0xFE0F;
private static bool IsSupplemental(int charCode) =>
0x1F900 <= charCode && charCode <= 0x1F9FF;
private static bool IsFlag(int charCode) =>
0x1F1E6 <= charCode && charCode <= 0x1F1FF;
}
The unicode scalar ranges used in the private methods can be found here: https://stackoverflow.com/a/36258684/361842
Info on how to get the Nth "character" from a string where not all characters are "char"s here: https://www.meziantou.net/how-to-correctly-count-the-number-of-characters-of-a-string.htm
Related MS documentation on the StringInfo class / SubstringByTextElements method here: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.substringbytextelements?view=net-7.0

- 22,735
- 13
- 96
- 178
Well that's simple, unicode emojis range is U+1F600..U+1F64F
.
C# uses UTF-16 strings, so you just convert the unicode range to actual UTF-16 characters
I couldn't find a calculator so I guess we are going to do this by hand
Assuming you have character U = 0x1F600
We use this formula for range of U+010000
to U+10FFFF
U_ = 0byyyyyyyyyyxxxxxxxxxx // U - 0x10000
W1 = 0b110110yyyyyyyyyy // 0xD800 + yyyyyyyyyy
W2 = 0b110111xxxxxxxxxx // 0xDC00 + xxxxxxxxxx
U_
will be 0xF600
(0b1111011000000000
)
which makes the first word W1
= 0b1101101111011000
(0xD83D
) and second word W2
= 0b1101111000000000
(0xDE00
).
We do the same thing with U+1F64F
and get
U_
= 0b1111011001001111
,
W1
= 0b1101100000111101
(0xD83D
),
W2
= 0b1101111001001111
(0xDE4F
)
You can notice that both first words are the same.
Assuming you have some string s
. You check every character of that string if it equals our W1
and if it does you check the next character is in range of 0xDE00
to 0xDE4F
.
const int emojiW1 = 0xD83D;
const int emojiW2Start = 0xD83D;
const int emojiW2End = 0xDE4F;
string s = Console.ReadLine();
for(int i = 0; i < s.Length - 1; i++) // `s.Length - 1` because emoji takes two characters
{
if(s[i] == emojiW1 && s[i + 1] >= emojiW2Start && s[i + 1] <= emojiW2End)
{
string emoji = s[i].ToString() + s[i + 1].ToString();
Console.Write(emoji);
++i; //let's skip s[i+1]
}
}
Notice this doesn't include emoji modifiers (like those for skin colors) but it would be the same process of finding the range and then checking if the characters if the characters are in that range.

- 650
- 7
- 15
Lazy variant, seems to work for my purpose. Thanks @JohnLBevan
string FirstEmoji(string s)
{
var e = StringInfo.GetNextTextElement(s);
var r = e.EnumerateRunes().First();
return Rune.IsSymbol(r) ? e : null;
}
var s1 = FirstEmoji(" family"); //
var s2 = FirstEmoji("family "); //null

- 1,686
- 1
- 17
- 25