1

I'm currently looking at regexs and emojis, and I'd like to use unicode property escapes to simplify the task

In https://unicode.org/reports/tr18/#Full_Properties, it lists a number of emoji properties such as Emoji and Emoji_Presentation etc.

Creating a regex using these patterns works:

const re = /\p{Emoji}/gu

The same page also lists RGI_Emoji, which is

The set of all emoji (characters and sequences) covered by ED-20, ED-21, ED-22, ED-23, ED-24, and ED-25.

or basic emojis, modifiers, etc, which seems to cover all use cases that I'm looking at.

However, creating a regex using this:

const re = /\p{RGI_Emoji}/gu

Gives a SyntaxError:

Uncaught SyntaxError: invalid property name in regular expression

The unicode page does mention that

Properties marked with * are properties of strings, not just single code points.

which RGI_Emoji is marked as. My knowledge of unicode isn't amazing, so I'm not sure of the proper way to use this.

Is it possible to use RGI_Emoji in a regex, and if so, what's the correct way to use it?

divillysausages
  • 7,883
  • 3
  • 25
  • 39
  • Start from `\p{RI} \p{RI} | \p{Emoji} ( \p{EMod} | \x{FE0F} \x{20E3}? | [\x{E0020}-\x{E007E}]+ \x{E007F} )? (\x{200D} \p{Emoji} ( \p{EMod} | \x{FE0F} \x{20E3}? | [\x{E0020}-\x{E007E}]+ \x{E007F} )? )*` regex for _possible emoji_ at https://unicode.org/reports/tr51/#EBNF_and_Regex ? – JosefZ Jan 31 '22 at 18:15

2 Answers2

3

RGI_Emoji is not available in JavaScript yet.

It is mentioned on top of the Full Properties table that,

Properties marked with * are properties of strings, not just single code points.

Support for following sequence properties is being proposed in proposal-regexp-unicode-sequence-properties. The proposal is at stage 2 i.e. not part of the ECMAScript specification yet and hence not available.

RGI_Emoji
Basic_Emoji
Emoji_Keycap_Sequence
RGI_Emoji_Modifier_Sequence
RGI_Emoji_Flag_Sequence
RGI_Emoji_Tag_Sequence
RGI_Emoji_ZWJ_Sequence

To further confirm, check available \p{UnicodeBinaryPropertyName}'s in the latest ECMAScript specification. Only following properties of characters related to emoji's are available:

Emoji
Emoji_Component
EComp
Emoji_Modifier
EMod
Emoji_Modifier_Base
EBase
Emoji_Presentation

You'll have to form a regular expression with unicode ranges covering ED-20, ED-21, ED-22, ED-23, ED-24, and ED-25 unicode sets. Like suggested by @JosefZ in a comment.
This discussion may help JavaScript regular expression for Unicode emoji

the Hutt
  • 16,980
  • 2
  • 14
  • 44
2

The emoji properties were only added to UTS #18 relatively recently (mid 2020), and this involved a significant change in Unicode's properties model in that it involved formally defining for the first time properties of strings. RGI_Emoji is a binary-valued property of strings of characters. A potential issue for use of string properties in regex is that the set corresponding to a string property is potentially a vast number of strings. To avoid potential problems in existing implementations, UTS #18 allows for use of the syntax \m{Property_Name} for string properties. See https://www.unicode.org/reports/tr18/#Resolving_Character_Ranges_with_Strings for more information.

It's possible that the implementation you're using has not been fully updated for Rev. 21 of UTS #18, with support for all new properties, or that it requires you to use the \m syntax for string properties.

The online Unicode UnicodeSet utility does support enumerating string results of a regex using the RGI_Emoji property:

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BRGI_Emoji%7D&g=&i=

Peter Constable
  • 2,707
  • 10
  • 23