Why do Unicode emoji property escapes match numbers?

Question

I found this awesome way to detect emojis using a regex that doesn't use "huge magic ranges" by using a Unicode property escape:

console.log(/\p{Emoji}/u.test('flowers ')) // true
console.log(/\p{Emoji}/u.test('flowers')) // false

But when I shared this knowledge in this answer, @Bronzdragon noticed that \p{Emoji} also matches numbers! Why is that? Numbers are not emojis?

console.log(/\p{Emoji}/u.test('flowers 123')) // unexpectdly true

// regex-only workaround by @Bonzdragon
const regex = /(?=\p{Emoji})(?!\p{Number})/u;
console.log(
  regex.test('flowers'), // false, as expected
  regex.test('flowers 123'), // false, as expected
  regex.test('flowers 123 '), // true, as expected
  regex.test('flowers '), // true, as expected
)

// more readable workaround
const hasEmoji = str => {
  const nbEmojiOrNumber = (str.match(/\p{Emoji}/gu) || []).length;
  const nbNumber = (str.match(/\p{Number}/gu) || []).length;
  return nbEmojiOrNumber > nbNumber;
}
console.log(
  hasEmoji('flowers'), // false, as expected
  hasEmoji('flowers 123'), // false, as expected
  hasEmoji('flowers 123 '), // true, as expected
  hasEmoji('flowers '), // true, as expected
)

Note that the workaround also fails for '123 flowers ' for example - that *should* return true, as it definitely has emoji. — Jon Skeet, Oct 16 '20 at 12:38
The question is not how to fix it ([here is a fix](https://stackoverflow.com/a/48148218/3832970)), the question is **why**. Else, let's close it. — Wiktor Stribiżew, Oct 16 '20 at 12:43
@WiktorStribiżew indeed, I am asking **why**, also I don't want to use one of these range-based regex because they're extremely long, unreadable, magic, and not resilient to the adding of new emojis — Nino Filiu, Oct 16 '20 at 13:18
I think the answer is [here](https://github.com/mathiasbynens/emoji-regex/issues/33#issuecomment-373674579) and all thread after that post. *This is not a bug. `#` and `0-9` are `Emoji` characters with a text representation by default, per the Unicode Standard.* — Wiktor Stribiżew, Oct 16 '20 at 13:25
[This post](https://github.com/mathiasbynens/emoji-regex/issues/33#issuecomment-374176872) goes into more detail and you probably can use the `/\p{Extended_Pictographic}/u` regex to match emojis except for some keycap base characters that are still emojis. — Wiktor Stribiżew, Oct 16 '20 at 13:35

Wiktor Stribiżew · Accepted Answer · 2020-10-17T17:15:21.770

According to this post, digtis, #, *, ZWJ and some more chars contain the Emoji property set to Yes, which means digits are considered valid emoji chars:

0023          ; Emoji_Component      #  1.1  [1] (#️)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*️)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0️..9️)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (‍)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (⃣)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (..)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (..)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (..)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (..)      tag space..cancel tag

For example, 1 is a digit, but it becomes an emoji when combined with U+FE0F and U+20E3 chars: 1️⃣:

console.log("1\uFE0F\u20E3 2\uFE0F\u20E3 3\uFE0F\u20E3 4\uFE0F\u20E3 5\uFE0F\u20E3 6\uFE0F\u20E3 7\uFE0F\u20E3 8\uFE0F\u20E3 9\uFE0F\u20E3 0\uFE0F\u20E3")

If you want to avoid matching digits, use Extended_Pictographic Unicode category class:

The Extended_Pictographic characters contain all the Emoji characters except for some Emoji_Components.

So, you may use either /\p{Extended_Pictographic}/gu to most emojis proper, or /\p{Extended_Pictographic}/u to test for a single emoji proper, or use /[\p{Extended_Pictographic}\u{1F3FB}-\u{1F3FF}\u{1F9B0}-\u{1F9B3}]/u to match emojis proper and light skin to dark skin mode chars and red-haired to white-haired chars:

const regex_emoji = /[\p{Extended_Pictographic}\u{1F3FB}-\u{1F3FF}\u{1F9B0}-\u{1F9B3}]/u;
console.log( regex_emoji.test('flowers 123') );     // => false
console.log( regex_emoji.test('flowers ') ); // => true

Thanks for the effort on the answer, if you could tell me **why** the unicode consortion considers 0123456789#* as emojis that'd be perfect! — Nino Filiu, Oct 17 '20 at 15:27
@NinoFiliu I added a demo showing how `1` turns into an emoji. — Wiktor Stribiżew, Oct 17 '20 at 17:15
Note that if you use this regex to remove emojis from strings (e.g. `'❌‍♀️‍♂️'.replace(/[\p{Extended_Pictographic}\u{1F3FB}-\u{1F3FF}\u{1F9B0}-\u{1F9B3}]/gu, '')`), there will be some leftover characters in the string (above resulting string has length 4). For this use case, I ended up using the [`emoji-regex`](https://www.npmjs.com/package/emoji-regex) npm package to match them. — Thai, Mar 07 '23 at 06:51
@Thai I have an all-embracing regex for [Emojis V14.0](https://stackoverflow.com/a/48148218/3832970), but I need to update it for the [current 15.1](https://unicode.org/Public/draft/emoji/emoji-test.txt). This answer is more about the `\p{Emoji}` construct. People just freak out when they see long regex patterns, so I tried to come up with something based on the Unicode category classes that is short and good enough. — Wiktor Stribiżew, Mar 07 '23 at 08:18
@WiktorStribiżew I agree, I think your solution is short and good enough for checking for the presence of emojis in a string. Furthermore, your answer is relevant to the question (you answered why) while mine isn’t (I suggested an npm package for a particular use case). However, I added the comment above here because this StackOverflow post came up first on Google when I try to debug the problem where `.replace(/\p{Emoji}/gu, '')` deleted the numbers. — Thai, Mar 08 '23 at 09:13
one way to think about this is that `\p{Emoji}` means "can this ever be part of an emoji" not "is this always an emoji". so it would be useful for eg checking whether a string is _entirely_ composed of emoji — MalcolmOcean, Aug 30 '23 at 13:59

score 3 · Answer 2 · answered May 19 '23 at 13:00

One of the problems with using \p{Emoji} is that Unicode defines Emoji as a character property, meaning it only captures individual characters or code points. As a result, \p{Emoji} might seem to solve your problem as long as you only test it against single-code point emoji such as (U+1FAF1), but that’s misleading.

However, the vast majority of emoji defined by Unicode consist of multiple code points, and thus cannot be matched by \p{Emoji}. For example: ‍ (U+1FAF1 U+1F3FF U+200D U+1FAF2 U+1F3FB).

const reEmojiCharacter = /^\p{Emoji}$/u;
reEmojiCharacter.test(''); // → true
reEmojiCharacter.test('‍'); // → false

Luckily, Unicode defines several properties of strings, which — you guessed it — are not restricted to just 1 code point at a time. The property of strings called RGI_Emoji includes all emoji that are officially recommended for general interchange, and is likely what you really want instead of Emoji.

In JavaScript regular expressions, you can use properties of strings when enabling the v flag.

const reEmoji = /^\p{RGI_Emoji}$/v;
reEmoji.test(''); // → true
reEmoji.test('‍'); // → true

Nice catch! I added your answer as a "see also" link of [this answer](https://stackoverflow.com/a/64007175/8186898) — Nino Filiu, May 22 '23 at 08:42

Why do Unicode emoji property escapes match numbers?

2 Answers2

Linked