Note: this question could look odd on systems not supporting the included emoji.
This is a follow-up question to How do I remove emoji from string.
I want to build a regular expression that matches all emoji that can be entered in Mac OS X / iOS.
The obvious Unicode blocks cover most, but not all of these emoji:
- U+1F300..U+1F5FF Miscellaneous Symbols And Pictographs
- U+1F600..U+1F64F Emoticons
- U+1F650..U+1F67F Ornamental Dingbats
- U+1F680..U+1F6FF Transport and Map Symbols
Wikipedia provides a compiled list of all the symbols available in Apple Color Emoji on OS X Mountain Lion and iOS 6, which looks like a good starting point: (slightly updated)
people = '☺️✨✊✌✋☝❤'
nature = '⭐☀⛅☁⚡☔❄⛄'
objects = '☎⏳⌛⏰⌚✉✂✒✏⚽⚾⛳☕'
places = '⛪⛺⛲⛵⚓✈⚠⛽♨'
symbols = '1️⃣2️⃣3️⃣4️⃣5️⃣6️⃣7️⃣8️⃣9️⃣0️⃣#️⃣⬆️⬇️⬅️➡️↗️↖️↘️↙️↔️↕️◀️▶️↩️↪️ℹ️⏪⏩⏫⏬⤵️⤴️️♿️Ⓜ️㊙️㊗️⛔✳️❇️❎✅✴️➿♻️♈️♉️♊️♋️♌️♍️♎️♏️♐️♑️♒️♓️⛎©️®️™️❌‼️⁉️❗❓❕❔⭕✖️➕➖➗♠♥♣♦✔☑➰〰〽️◼️◻️◾️◽️▪️▫️⚫️⚪️⬜️⬛️'
emoji = people + nature + objects + places + symbols # all emoji combined
Most characters have a single code point and converting these would be easy:
- U+1F600 (Grinning Face)
But some characters are "encoded using two Unicode values":
- ☺️ U+263A U+FE0F (White Smiling Face, Variation Selector 16)
- U+1F1EF U+1F1F5 (Regional Indicator Symbol Letter J / Regional Indicator Symbol Letter P)
- ⬛️ U+2B1B U+FE0F (Black Large Square / Variation Selector 16)
And some even have 3 codepoints:
️⃣ U+0023 U+FE0F U+20E3 (Number Sign / Variation Selector 16 / Combining Enclosing Keycap)
(Variation Selector 16 means "emoji style")
How can I split this list into characters (without splitting combined characters), find their code point(s) and finally build a regular expression matching them?
The regex doesn't have to respect "missing" characters within larger blocks, i.e. it's okay if the 4 Unicode blocks mentioned above are entirely covered.
(I'm going to answer this myself if I don't get any answers, but maybe there's an easy solution)