4

I would like a regex to match emoji characters in C#. If it matters, it's the characters from the Windows 8 touch keyboard ie.

Jippers
  • 2,635
  • 5
  • 37
  • 58
  • Apparently, the Unicode standard is working on adding some new regex properties to support this: https://stackoverflow.com/a/70936276/54323 – brianary Apr 20 '23 at 02:45

4 Answers4

4

There seems to be an Emoji-to-Unicode standard:

https://en.wikipedia.org/wiki/Emoji#In_Unicode

So you can probably match each of the Unicode ranges. For example, to match the range from U+1F30x to U+1F5Fx you can use [\u1F30-\u1F5F] etc.

matthew-e-brown
  • 2,837
  • 1
  • 10
  • 29
Ilya Kogan
  • 21,995
  • 15
  • 85
  • 141
  • Does regex support 5 digit unicode characters? I'm using Expresso regex tester and it doesn't understand that these are 5 digits long. – Jippers Jan 25 '13 at 18:18
  • 1
    Maybe this will help: http://stackoverflow.com/questions/364009/c-sharp-regular-expressions-with-uxxxxxxxx-characters-in-the-pattern – Ilya Kogan Jan 25 '13 at 20:10
  • I guess it's not possible then. Those articles are dated 2008 but say that it's basically not possible to go beyond \uFFFF. – Jippers Jan 25 '13 at 20:45
  • 1
    I was trying to match ✅ and and saw this question. but answers didn't solve my problem. Finally I used this for regex pattern `\p{So}` . – MohaMad Mar 27 '20 at 01:02
  • 1
    @MohaMad Why don't you post it as an answer – Ilya Kogan Nov 26 '20 at 10:22
  • You're right @IlyaKogan , I posted it as an answer right now, hope help other developers. – MohaMad Dec 09 '20 at 20:58
2

\p{So}|\p{Cs}\p{Cs}(\p{Cf}\p{Cs}\p{Cs})* match all emojis I've tried and only those.

StringInfo was useful to make the pattern and might be usable directly instead of regex in some cases.

The pattern uses unicode categories, as shown in @MohaMad's answer. Again, with comments:

@"(?x)           # Enable free-spacing-mode (could have used RegexOptions instead)
\p{So}           # Match OtherSymbol, like ⏸ and ✅
|\p{Cs}\p{Cs}    # OR two Surrogate
 \uD83C\p{Cs}    # with color-modifier, like  and 
                 # (Hacky special case of Multibyte Character Set? It works.)
|\p{Cs}\p{Cs}    # OR two Surrogate, like  and 
 (\p{Cf}         # followed by a Format
 \p{Cs}\p{Cs})   # and two Surrogate, like ‍ and ‍.
*                # zero or more times (I've only seen none or once.)"
Grastveit
  • 15,770
  • 3
  • 27
  • 36
1

I used Unicode General Categories and Named Blocks for this problem and described it in a short comment below the accepted answer:

I was trying to match ✅ and and saw this question. but answers didn't solve my problem. Finally I used this for regex pattern \p{So}

for more information about Named Blocks and Unicode General Categories visit Microsoft Regular Expression Help Topic .

You're able to use different names for BasicLatin, ExtendedLatin, Arabic, Cyrilic and ... Also more specific Symbols matching with S family, like Currency Symbols or Math Symbols.

MohaMad
  • 2,575
  • 2
  • 14
  • 26
  • 2
    This is the correct way, except that I couldn't match emojis using `\p{So}` (which detects symbols) but rather `\p{Cs}` (which detects surrogate characters) – Cobus Kruger Aug 16 '22 at 15:34
  • `\p{Cs}` will match anything in the [Supplementary Multilingual Plane](https://en.wikipedia.org/wiki/Plane_%28Unicode%29#Supplementary_Multilingual_Plane), which will include a lot of non-English text. `\p{So}` only matches 58/1179 of the `Basic_Emoji` defined by [emoji-sequences.txt](https://www.unicode.org/Public/emoji/15.0/emoji-sequences.txt). – brianary Apr 20 '23 at 02:41
0

You should be able to plug in the unicode code value to represent them:

Regex regEx = new Regex(@"\uXXXX\uYYYY");

Where XXXX and YYYY are the unicode values of the characters you're looking for (of course changing the regular expression to fit your needs).

Cemafor
  • 1,633
  • 12
  • 27