9

Note: this question could look odd on systems not supporting the included emoji.

This is a follow-up question to How do I remove emoji from string.

I want to build a regular expression that matches all emoji that can be entered in Mac OS X / iOS.

The obvious Unicode blocks cover most, but not all of these emoji:

Wikipedia provides a compiled list of all the symbols available in Apple Color Emoji on OS X Mountain Lion and iOS 6, which looks like a good starting point: (slightly updated)

people  = '☺️✨✊✌✋☝❤'
nature  = '⭐☀⛅☁⚡☔❄⛄'
objects = '☎⏳⌛⏰⌚✉✂✒✏⚽⚾⛳☕'
places  = '⛪⛺⛲⛵⚓✈⚠⛽♨'
symbols = '1️⃣2️⃣3️⃣4️⃣5️⃣6️⃣7️⃣8️⃣9️⃣0️⃣#️⃣⬆️⬇️⬅️➡️↗️↖️↘️↙️↔️↕️◀️▶️↩️↪️ℹ️⏪⏩⏫⏬⤵️⤴️️♿️Ⓜ️㊙️㊗️⛔✳️❇️❎✅✴️➿♻️♈️♉️♊️♋️♌️♍️♎️♏️♐️♑️♒️♓️⛎©️®️™️❌‼️⁉️❗❓❕❔⭕✖️➕➖➗♠♥♣♦✔☑➰〰〽️◼️◻️◾️◽️▪️▫️⚫️⚪️⬜️⬛️'

emoji = people + nature + objects + places + symbols # all emoji combined

Most characters have a single code point and converting these would be easy:

  • U+1F600 (Grinning Face)

But some characters are "encoded using two Unicode values":

  • ☺️ U+263A U+FE0F (White Smiling Face, Variation Selector 16)
  • U+1F1EF U+1F1F5 (Regional Indicator Symbol Letter J / Regional Indicator Symbol Letter P)
  • ⬛️ U+2B1B U+FE0F (Black Large Square / Variation Selector 16)

And some even have 3 codepoints:

  • ️⃣ U+0023 U+FE0F U+20E3 (Number Sign / Variation Selector 16 / Combining Enclosing Keycap)

(Variation Selector 16 means "emoji style")

How can I split this list into characters (without splitting combined characters), find their code point(s) and finally build a regular expression matching them?

The regex doesn't have to respect "missing" characters within larger blocks, i.e. it's okay if the 4 Unicode blocks mentioned above are entirely covered.

(I'm going to answer this myself if I don't get any answers, but maybe there's an easy solution)

Community
  • 1
  • 1
Stefan
  • 109,145
  • 14
  • 143
  • 218
  • 1
    Wait? Does that mean I could theoretically (if the font has a glyph for it) construct *any* flag just by using the ISO 3166-1 ALPHA-2 code? Like, Regional Indicator Symbol Letter D / Regional Indicator Symbol Letter E for Germany (ISO 3166-1 ALPHA-2 code: de). – Jörg W Mittag Jul 11 '14 at 13:11
  • 2
    @JörgWMittag it depends, from http://www.unicode.org/charts/PDF/U1F100.pdf: *"These characters can be used in pairs to represent regional codes. In some emoji implementations, certain pairs may be recognized and displayed by alternate means; for instance, an implementation might recognize F + R and display this combination with a symbol representing the flag of France."* – Stefan Jul 11 '14 at 13:20
  • could I ask how the second one is two unicode values: `U+1F1EF U+1F1F5 (Regional Indicator Symbol Letter J / Regional Indicator Symbol Letter P)` isn't that 2 symbols (that's what it looks like on my computer)? – Mike H-R Jul 11 '14 at 14:42
  • @MikeH-R yes, it's J and P, but it's displayed as a single symbol (flag of Japan) on Mac OS X / iOS – Stefan Jul 11 '14 at 14:48
  • ah, I get separate symbols on Arch Linux Gnome. For the record. It looks like the symbols that have multiple codepoints are in the minority (it appears less than 20). As far as I can tell there is only one in the first case that is split into 2 characters after splitting with: `people.each_char.map {|x| x}`. So from a practical standpoint if you can get the number of symbols that should be in each list, then you can compare the size and check if there are any codepoints that are always part of a double (like `U+FE0F`) and use that as your list. it should then be obvious what to do. (Hopefully) – Mike H-R Jul 11 '14 at 14:54
  • Obviously this could be wrong as my computer doesn't seem to render some of them correctly, if so just point that out and I'll delete the comment. :) – Mike H-R Jul 11 '14 at 14:55
  • Also, reading [this FAQ](http://www.unicode.org/faq/vs.html) and looking [here](http://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt) it looks like apple may be using non-standard variants which would probably make it more difficult for you I'm afraid. (the U+1F1F5 codepoint is not a standardised variant). – Mike H-R Jul 11 '14 at 15:02
  • @MikeH-R have you seen my reply to JörgWMittag above regarding the flags? The PDF mentions "pair" usage, so it seems to be a common implementation (or maybe just Apple's implementation). – Stefan Jul 11 '14 at 15:34
  • Ahhh, I had missed that. – Mike H-R Jul 11 '14 at 15:36
  • Is there any regex that also takes into account the textual variation selector https://codepoints.net/U+FE0E?lang=en? So it wouldn't match an emoji when succeeded by it? – ragurney Jan 17 '20 at 22:47

2 Answers2

4

The upcoming Unicode Emoji data files would help with this. At the moment these are still drafts, but they might still help you out.

By parsing http://www.unicode.org/Public/emoji/1.0/emoji-data.txt you could get quite easily get a list of all emoji in the Unicode standard. (Note that some of these emoji consist of multiple code points.) Once you have such a list, it’s trivial to turn it into a regular expression.

Here’s a JavaScript version: https://github.com/mathiasbynens/emoji-regex/blob/master/index.js And here’s the script that generates it based on the data from emoji-data.txt: https://github.com/mathiasbynens/emoji-regex/blob/master/scripts/generate-regex.js

Mathias Bynens
  • 144,855
  • 52
  • 216
  • 248
3

This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:

[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]

Examples can be found here: https://stackoverflow.com/a/29115920/1911674

EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments from How do I remove emoji from string for details.

Community
  • 1
  • 1
franklsf95
  • 1,182
  • 12
  • 23