1

I use this regular expression to remove all possible emojis from a string.

/(\x{00a9}|\x{00ae}|[\x{2000}-\x{3300}]|\x{d83c}[\x{d000}-\x{dfff}]|\x{d83d}[\x{d000}-\x{dfff}]|\x{d83e}[\x{d000}-\x{dfff}])/u

but it throws this exception:

preg_replace(): Compilation failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at offset 46

I googled about this problem, but I couldn't find any accurate answer about this problem. I will be appreciated if someone tell me what this error exactly means and what the solution is.

Also what is this:

>= 0xd800 && <= 0xdfff

Above regex is PCRE version of this source:

https://www.regextester.com/106421

HosSeinM
  • 301
  • 1
  • 6
  • 14
  • 1
    Removing all possible emojis is not going to be feasible. What if a new emoji is added in future? Check out this post: https://stackoverflow.com/a/48946207/1839439 – Dharman Nov 05 '19 at 17:50
  • @Dharman Yes. I want a regex to remove all possible emojis that make problem for excel exporting. The source regex that is properly working for javascript is included at the end of question. – HosSeinM Nov 05 '19 at 17:51
  • 1
    Working for JavaScript is the keyword here. Remember that JS is broken and can't handle Unicode properly. The regex which you are using will not work in PHP, which considers Unicode code points, not UTF-16 code units. The surrogate pairs you are trying to match against are illegal in PHP. – Dharman Nov 05 '19 at 17:53
  • 1
    Maybe this would help https://stackoverflow.com/a/51951236/1839439 or this https://stackoverflow.com/q/35961245/1839439 or this https://stackoverflow.com/a/20208095/1839439 – Dharman Nov 05 '19 at 17:55
  • Honestly I would try to figure out how to not remove the emojis. It should be the easier solution to make the "excel reporting" work with emojis instead of filtering them out. – Dharman Nov 05 '19 at 17:58
  • Thanks @Dharman. At this moment I can only change previous regex to support new emojis (since 2018 I think). – HosSeinM Nov 05 '19 at 18:01
  • @Dharman Refere to this link https://stackoverflow.com/questions/51947319/php-how-to-match-a-range-of-unicode-paired-surrogates-emoticons-emoji/51951236#51951236, how can I make my php shipped with a PCRE build for UTF-16? – HosSeinM Nov 05 '19 at 18:04
  • So you just need to convert the pattern to the PHP PCRE one? What exact ranges of code points do you want to match? `\u{d83c}\u{d000}` is illegal, what is that supposed to match? – Wiktor Stribiżew Nov 05 '19 at 18:15
  • @WiktorStribiżew Yes I want. Why is it illegal? It would be used to match all possible emojis. – HosSeinM Nov 05 '19 at 18:43
  • 1
    Because this code point does not exist. Aha, so you want to match all emojis? v12.1? – Wiktor Stribiżew Nov 05 '19 at 18:46
  • @WiktorStribiżew So how is it exist in js? – HosSeinM Nov 05 '19 at 19:01
  • 1
    It does not exist in JS, in JS, the two bytes are independent, in PHP, the bytes are joined into a single code point - here is where the failure occurs (JS does not try to do that). – Wiktor Stribiżew Nov 05 '19 at 19:02
  • Aha. Thank you @WiktorStribiżew. Could you edit this regex to be used by PCRE or I sould search some other regex? – HosSeinM Nov 05 '19 at 19:14
  • 1
    @HosSeinM This regex cannot be converted, I understand you just need a new regex that will match all empjis as defined in the [Unicode Emoji v12.1 standard](http://unicode.org/emoji/charts/full-emoji-list.html). – Wiktor Stribiżew Nov 05 '19 at 19:16
  • @WiktorStribiżew I'll do it. Thank u anyway. – HosSeinM Nov 05 '19 at 20:20
  • I will be able to help you tomorrow. If you get stuck, drop a comment with my @username to let me know. – Wiktor Stribiżew Nov 05 '19 at 20:33

3 Answers3

1

Emojis are specified in UAX #51. The property \p{Emoji} should work, but doesn't.

Do it the hard way. Parse emoji-*.txt:

perl -C -lne'
    if (my ($c) = $_ =~ /^((?:(?:[[:xdigit:]]+ )|[[:xdigit:]]+\.\.)[[:xdigit:]]+)/) {
        if ($c =~ /\.\./) { # ranges
             my ($f, $t) = map { hex } split /\.\./, $c;
             print for map { chr } $f..$t;
        } else { # sequences
             print join "", map { chr hex } split /\s+/, $c;
        }
    }
' emoji-*.txt

This gives us a newline separated list of all emojis. Using Regexp::Assemble::Compressed, the result is

(?:[\x{23EB}\x{23EC}\x{23F0}\x{2605}\x{2607}-\x{260D}\x{260F}\x{2610}\x{2612}\x{2616}\x{2617}\x{261A}-\x{261C}\x{261E}\x{261F}\x{2621}\x{2624}\x{2625}\x{2627}-\x{2629}\x{262B}-\x{262D}\x{2630}-\x{2637}\x{263B}-\x{263F}\x{2641}\x{2643}-\x{2647}\x{2654}-\x{265E}\x{2661}\x{2662}\x{2664}\x{2667}\x{2669}-\x{267A}\x{267C}\x{267D}\x{2680}-\x{2685}\x{2690}\x{2691}\x{2698}\x{269A}\x{269E}\x{269F}\x{26A2}-\x{26A9}\x{26AC}-\x{26AF}\x{26B3}-\x{26BC}\x{26BF}-\x{26C3}\x{26C6}\x{26C7}\x{26C9}-\x{26CD}\x{26D0}\x{26D2}\x{26D5}-\x{26E1}\x{26E4}-\x{26E8}\x{26EB}-\x{26EF}\x{26F6}\x{26FB}\x{26FC}\x{26FE}\x{26FF}\x{2701}\x{2703}\x{2704}\x{270E}\x{2710}\x{2711}\x{2754}\x{2755}\x{2765}-\x{2767}\x{2795}-\x{2797}\x{1F000}-\x{1F003}\x{1F005}-\x{1F0BE}\x{1F0C1}-\x{1F0CF}\x{1F0D1}-\x{1F0FF}\x{1F10D}-\x{1F10F}\x{1F16D}-\x{1F16F}\x{1F191}-\x{1F19A}\x{1F1AD}-\x{1F1E5}\x{1F201}\x{1F203}-\x{1F20F}\x{1F232}-\x{1F236}\x{1F238}-\x{1F23A}\x{1F23C}-\x{1F23F}\x{1F249}-\x{1F30C}\x{1F310}-\x{1F314}\x{1F316}-\x{1F31B}\x{1F31D}-\x{1F320}\x{1F322}\x{1F323}\x{1F32D}-\x{1F335}\x{1F337}-\x{1F377}\x{1F379}-\x{1F37C}\x{1F37E}-\x{1F384}\x{1F386}-\x{1F392}\x{1F394}\x{1F395}\x{1F398}\x{1F39C}\x{1F39D}\x{1F3A0}-\x{1F3A6}\x{1F3A8}-\x{1F3AB}\x{1F3AF}-\x{1F3C1}\x{1F3C8}\x{1F3C9}\x{1F3CF}-\x{1F3D3}\x{1F3E1}-\x{1F3EC}\x{1F3EE}-\x{1F3F2}\x{1F3F6}\x{1F3F8}-\x{1F407}\x{1F409}-\x{1F414}\x{1F416}-\x{1F41E}\x{1F420}-\x{1F425}\x{1F427}-\x{1F43E}\x{1F444}\x{1F445}\x{1F451}\x{1F452}\x{1F454}-\x{1F465}\x{1F479}-\x{1F47B}\x{1F47E}-\x{1F480}\x{1F484}\x{1F488}-\x{1F4A2}\x{1F4A4}-\x{1F4A9}\x{1F4AB}-\x{1F4AF}\x{1F4B1}\x{1F4B2}\x{1F4B4}-\x{1F4BA}\x{1F4BC}-\x{1F4BE}\x{1F4C0}-\x{1F4CA}\x{1F4CC}-\x{1F4D9}\x{1F4DB}-\x{1F4DE}\x{1F4E0}-\x{1F4E3}\x{1F4E7}-\x{1F4E9}\x{1F4EE}-\x{1F4F6}\x{1F4FC}\x{1F4FE}\x{1F500}-\x{1F507}\x{1F509}-\x{1F50C}\x{1F50E}-\x{1F511}\x{1F514}-\x{1F53D}\x{1F546}-\x{1F548}\x{1F54B}-\x{1F54F}\x{1F568}-\x{1F56E}\x{1F571}\x{1F572}\x{1F57B}-\x{1F586}\x{1F588}\x{1F589}\x{1F58E}\x{1F58F}\x{1F591}-\x{1F594}\x{1F597}-\x{1F5A3}\x{1F5A6}\x{1F5A7}\x{1F5A9}-\x{1F5B0}\x{1F5B3}-\x{1F5BB}\x{1F5BD}-\x{1F5C1}\x{1F5C5}-\x{1F5D0}\x{1F5D4}-\x{1F5DB}\x{1F5DF}\x{1F5E0}\x{1F5E2}\x{1F5E4}-\x{1F5E7}\x{1F5E9}-\x{1F5EE}\x{1F5F0}-\x{1F5F2}\x{1F5F4}-\x{1F5F9}\x{1F5FB}-\x{1F5FF}\x{1F601}-\x{1F60F}\x{1F612}-\x{1F614}\x{1F61C}-\x{1F61E}\x{1F620}-\x{1F62B}\x{1F62E}-\x{1F633}\x{1F635}-\x{1F644}\x{1F648}-\x{1F64A}\x{1F680}-\x{1F686}\x{1F688}-\x{1F68C}\x{1F68E}-\x{1F690}\x{1F692}\x{1F693}\x{1F695}-\x{1F697}\x{1F699}-\x{1F6A2}\x{1F6A4}-\x{1F6AC}\x{1F6AE}-\x{1F6B1}\x{1F6B3}\x{1F6B7}\x{1F6B8}\x{1F6BB}\x{1F6BD}-\x{1F6BF}\x{1F6C1}-\x{1F6CA}\x{1F6D1}-\x{1F6D4}\x{1F6D6}-\x{1F6DF}\x{1F6E6}-\x{1F6E8}\x{1F6EA}-\x{1F6EF}\x{1F6F1}\x{1F6F2}\x{1F6F4}-\x{1F6F8}\x{1F6FB}-\x{1F6FF}\x{1F774}-\x{1F77F}\x{1F7D5}-\x{1F7FF}\x{1F80C}-\x{1F80F}\x{1F848}-\x{1F84F}\x{1F85A}-\x{1F85F}\x{1F888}-\x{1F88F}\x{1F8AE}-\x{1F8FF}\x{1F90D}\x{1F90E}\x{1F910}-\x{1F917}\x{1F91D}\x{1F920}-\x{1F925}\x{1F927}-\x{1F92F}\x{1F93A}\x{1F940}-\x{1F945}\x{1F947}-\x{1F94B}\x{1F94D}-\x{1F970}\x{1F973}-\x{1F979}\x{1F97C}-\x{1F9B4}\x{1F9B7}\x{1F9BA}\x{1F9BC}-\x{1F9BF}\x{1F9C1}-\x{1F9CC}\x{1F9D0}\x{1F9E0}-\x{1FFFD}\x{E0020}-\x{E007F}]|\x{1F1F2}[\x{1F1E6}\x{1F1E8}-\x{1F1ED}\x{1F1F0}-\x{1F1FF}]?|\x{1F1E7}[\x{1F1E6}\x{1F1E7}\x{1F1E9}-\x{1F1EF}\x{1F1F1}-\x{1F1F4}\x{1F1F6}-\x{1F1F9}\x{1F1FB}\x{1F1FC}\x{1F1FE}\x{1F1FF}]?|\x{1F1F8}[\x{1F1E6}-\x{1F1EA}\x{1F1EC}-\x{1F1F4}\x{1F1F7}-\x{1F1F9}\x{1F1FB}\x{1F1FD}-\x{1F1FF}]?|\x{1F1E8}[\x{1F1E6}\x{1F1E8}\x{1F1E9}\x{1F1EB}-\x{1F1EE}\x{1F1F0}-\x{1F1F5}\x{1F1F7}\x{1F1FA}-\x{1F1FF}]?|\x{1F1EC}[\x{1F1E6}\x{1F1E7}\x{1F1E9}-\x{1F1EE}\x{1F1F1}-\x{1F1F3}\x{1F1F5}-\x{1F1FA}\x{1F1FC}\x{1F1FE}]?|\x{1F1E6}[\x{1F1E8}-\x{1F1EC}\x{1F1EE}\x{1F1F1}\x{1F1F2}\x{1F1F4}\x{1F1F6}-\x{1F1FA}\x{1F1FC}\x{1F1FD}\x{1F1FF}]?|\x{1F1F9}[\x{1F1E6}\x{1F1E8}\x{1F1E9}\x{1F1EB}-\x{1F1ED}\x{1F1EF}-\x{1F1F4}\x{1F1F7}\x{1F1F9}\x{1F1FB}\x{1F1FC}\x{1F1FF}]?|\x{1F1F5}[\x{1F1E6}\x{1F1EA}-\x{1F1ED}\x{1F1F0}-\x{1F1F3}\x{1F1F7}-\x{1F1F9}\x{1F1FC}\x{1F1FE}]?|\x{1F1F3}[\x{1F1E6}\x{1F1E8}\x{1F1EA}-\x{1F1EC}\x{1F1EE}\x{1F1F1}\x{1F1F4}\x{1F1F5}\x{1F1F7}\x{1F1FA}\x{1F1FF}]?|\x{1F1EE}[\x{1F1E8}-\x{1F1EA}\x{1F1F1}-\x{1F1F4}\x{1F1F6}-\x{1F1F9}]?|\x{1F1F0}[\x{1F1EA}\x{1F1EC}-\x{1F1EE}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F7}\x{1F1FC}\x{1F1FE}\x{1F1FF}]?|\x{1F1F1}[\x{1F1E6}-\x{1F1E8}\x{1F1EE}\x{1F1F0}\x{1F1F7}-\x{1F1FB}\x{1F1FE}]?|\x{1F1EA}[\x{1F1E6}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1ED}\x{1F1F7}-\x{1F1FA}]?|\x{26F9}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F3C4}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F3CA}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F3CB}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F3CC}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F575}[\x{200D}\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{261D}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{270C}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{270D}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F1E9}[\x{1F1EA}\x{1F1EC}\x{1F1EF}\x{1F1F0}\x{1F1F2}\x{1F1F4}\x{1F1FF}]?|\x{1F1FA}[\x{1F1E6}\x{1F1EC}\x{1F1F2}\x{1F1F3}\x{1F1F8}\x{1F1FE}\x{1F1FF}]?|\x{1F1FB}[\x{1F1E6}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1EE}\x{1F1F3}\x{1F1FA}]?|\x{1F3C2}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F442}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F446}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F447}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F448}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F449}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F44D}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F44E}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F574}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F590}[\x{FE0E}\x{FE0F}\x{1F3FB}-\x{1F3FF}]?|\x{1F1EB}[\x{1F1EE}-\x{1F1F0}\x{1F1F2}\x{1F1F4}\x{1F1F7}]?|\x{1F1ED}[\x{1F1F0}\x{1F1F2}\x{1F1F3}\x{1F1F7}\x{1F1F9}\x{1F1FA}]?|\x{1F3C3}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F468}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F469}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F46E}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F471}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F473}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F477}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F481}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F482}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F486}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F487}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F645}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F646}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F647}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F64B}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F64D}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F64E}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F6A3}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F6B4}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F6B5}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F6B6}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F926}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F937}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F938}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F939}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F93D}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F93E}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9B8}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9B9}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9CD}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9CE}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9CF}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D1}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D6}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D7}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D8}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9D9}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9DA}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9DB}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9DC}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{1F9DD}[\x{200D}\x{1F3FB}-\x{1F3FF}]?|\x{270A}[\x{1F3FB}-\x{1F3FF}]?|\x{270B}[\x{1F3FB}-\x{1F3FF}]?|\x{1F1F7}[\x{1F1EA}\x{1F1F4}\x{1F1F8}\x{1F1FA}\x{1F1FC}]?|\x{1F385}[\x{1F3FB}-\x{1F3FF}]?|\x{1F3C7}[\x{1F3FB}-\x{1F3FF}]?|\x{1F443}[\x{1F3FB}-\x{1F3FF}]?|\x{1F44A}[\x{1F3FB}-\x{1F3FF}]?|\x{1F44B}[\x{1F3FB}-\x{1F3FF}]?|\x{1F44C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F44F}[\x{1F3FB}-\x{1F3FF}]?|\x{1F450}[\x{1F3FB}-\x{1F3FF}]?|\x{1F466}[\x{1F3FB}-\x{1F3FF}]?|\x{1F467}[\x{1F3FB}-\x{1F3FF}]?|\x{1F46B}[\x{1F3FB}-\x{1F3FF}]?|\x{1F46C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F46D}[\x{1F3FB}-\x{1F3FF}]?|\x{1F470}[\x{1F3FB}-\x{1F3FF}]?|\x{1F472}[\x{1F3FB}-\x{1F3FF}]?|\x{1F474}[\x{1F3FB}-\x{1F3FF}]?|\x{1F475}[\x{1F3FB}-\x{1F3FF}]?|\x{1F476}[\x{1F3FB}-\x{1F3FF}]?|\x{1F478}[\x{1F3FB}-\x{1F3FF}]?|\x{1F47C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F483}[\x{1F3FB}-\x{1F3FF}]?|\x{1F485}[\x{1F3FB}-\x{1F3FF}]?|\x{1F4AA}[\x{1F3FB}-\x{1F3FF}]?|\x{1F595}[\x{1F3FB}-\x{1F3FF}]?|\x{1F596}[\x{1F3FB}-\x{1F3FF}]?|\x{1F64C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F64F}[\x{1F3FB}-\x{1F3FF}]?|\x{1F6C0}[\x{1F3FB}-\x{1F3FF}]?|\x{1F6CC}[\x{1F3FB}-\x{1F3FF}]?|\x{1F90F}[\x{1F3FB}-\x{1F3FF}]?|\x{1F918}[\x{1F3FB}-\x{1F3FF}]?|\x{1F919}[\x{1F3FB}-\x{1F3FF}]?|\x{1F91A}[\x{1F3FB}-\x{1F3FF}]?|\x{1F91B}[\x{1F3FB}-\x{1F3FF}]?|\x{1F91C}[\x{1F3FB}-\x{1F3FF}]?|\x{1F91E}[\x{1F3FB}-\x{1F3FF}]?|\x{1F931}[\x{1F3FB}-\x{1F3FF}]?|\x{1F932}[\x{1F3FB}-\x{1F3FF}]?|\x{1F933}[\x{1F3FB}-\x{1F3FF}]?|\x{1F934}[\x{1F3FB}-\x{1F3FF}]?|\x{1F935}[\x{1F3FB}-\x{1F3FF}]?|\x{1F936}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9B5}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9B6}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9BB}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9D2}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9D3}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9D4}[\x{1F3FB}-\x{1F3FF}]?|\x{1F9D5}[\x{1F3FB}-\x{1F3FF}]?|\x{1F1EF}[\x{1F1EA}\x{1F1F2}\x{1F1F4}\x{1F1F5}]?|\x{1F57A}[\x{1F3FB}-\x{1F3FF}]|\x{1F91F}[\x{1F3FB}-\x{1F3FF}]|\x{1F930}[\x{1F3FB}-\x{1F3FF}]|0[\x{20E3}\x{FE0E}\x{FE0F}]?|1[\x{20E3}\x{FE0E}\x{FE0F}]?|2[\x{20E3}\x{FE0E}\x{FE0F}]?|3[\x{20E3}\x{FE0E}\x{FE0F}]?|4[\x{20E3}\x{FE0E}\x{FE0F}]?|5[\x{20E3}\x{FE0E}\x{FE0F}]?|6[\x{20E3}\x{FE0E}\x{FE0F}]?|7[\x{20E3}\x{FE0E}\x{FE0F}]?|8[\x{20E3}\x{FE0E}\x{FE0F}]?|9[\x{20E3}\x{FE0E}\x{FE0F}]?|\\*[\x{20E3}\x{FE0E}\x{FE0F}]|\x{1F1FF}[\x{1F1E6}\x{1F1F2}\x{1F1FC}]?|\x{1F3F3}[\x{200D}\x{FE0E}\x{FE0F}]?|\x{1F415}[\x{200D}\x{FE0E}\x{FE0F}]?|#[\x{20E3}\x{FE0E}\x{FE0F}]|\x{2194}[\x{FE0E}\x{FE0F}]?|\x{2195}[\x{FE0E}\x{FE0F}]?|\x{2196}[\x{FE0E}\x{FE0F}]?|\x{2197}[\x{FE0E}\x{FE0F}]?|\x{2198}[\x{FE0E}\x{FE0F}]?|\x{2199}[\x{FE0E}\x{FE0F}]?|\x{21A9}[\x{FE0E}\x{FE0F}]?|\x{21AA}[\x{FE0E}\x{FE0F}]?|\x{231A}[\x{FE0E}\x{FE0F}]?|\x{231B}[\x{FE0E}\x{FE0F}]?|\x{23E9}[\x{FE0E}\x{FE0F}]?|\x{23EA}[\x{FE0E}\x{FE0F}]?|\x{23ED}[\x{FE0E}\x{FE0F}]?|\x{23EE}[\x{FE0E}\x{FE0F}]?|\x{23EF}[\x{FE0E}\x{FE0F}]?|\x{23F1}[\x{FE0E}\x{FE0F}]?|\x{23F2}[\x{FE0E}\x{FE0F}]?|\x{23F3}[\x{FE0E}\x{FE0F}]?|\x{23F8}[\x{FE0E}\x{FE0F}]?|\x{23F9}[\x{FE0E}\x{FE0F}]?|\x{23FA}[\x{FE0E}\x{FE0F}]?|\x{25AA}[\x{FE0E}\x{FE0F}]?|\x{25AB}[\x{FE0E}\x{FE0F}]?|\x{25FB}[\x{FE0E}\x{FE0F}]?|\x{25FC}[\x{FE0E}\x{FE0F}]?|\x{25FD}[\x{FE0E}\x{FE0F}]?|\x{25FE}[\x{FE0E}\x{FE0F}]?|\x{2600}[\x{FE0E}\x{FE0F}]?|\x{2601}[\x{FE0E}\x{FE0F}]?|\x{2602}[\x{FE0E}\x{FE0F}]?|\x{2603}[\x{FE0E}\x{FE0F}]?|\x{2604}[\x{FE0E}\x{FE0F}]?|\x{260E}[\x{FE0E}\x{FE0F}]?|\x{2611}[\x{FE0E}\x{FE0F}]?|\x{2614}[\x{FE0E}\x{FE0F}]?|\x{2615}[\x{FE0E}\x{FE0F}]?|\x{2620}[\x{FE0E}\x{FE0F}]?|\x{2622}[\x{FE0E}\x{FE0F}]?|\x{2623}[\x{FE0E}\x{FE0F}]?|\x{2626}[\x{FE0E}\x{FE0F}]?|\x{262A}[\x{FE0E}\x{FE0F}]?|\x{262E}[\x{FE0E}\x{FE0F}]?|\x{262F}[\x{FE0E}\x{FE0F}]?|\x{2638}[\x{FE0E}\x{FE0F}]?|\x{2639}[\x{FE0E}\x{FE0F}]?|\x{263A}[\x{FE0E}\x{FE0F}]?|\x{2640}[\x{FE0E}\x{FE0F}]?|\x{2642}[\x{FE0E}\x{FE0F}]?|\x{2648}[\x{FE0E}\x{FE0F}]?|\x{2649}[\x{FE0E}\x{FE0F}]?|\x{264A}[\x{FE0E}\x{FE0F}]?|\x{264B}[\x{FE0E}\x{FE0F}]?|\x{264C}[\x{FE0E}\x{FE0F}]?|\x{264D}[\x{FE0E}\x{FE0F}]?|\x{264E}[\x{FE0E}\x{FE0F}]?|\x{264F}[\x{FE0E}\x{FE0F}]?|\x{2650}[\x{FE0E}\x{FE0F}]?|\x{2651}[\x{FE0E}\x{FE0F}]?|\x{2652}[\x{FE0E}\x{FE0F}]?|\x{2653}[\x{FE0E}\x{FE0F}]?|\x{265F}[\x{FE0E}\x{FE0F}]?|\x{2660}[\x{FE0E}\x{FE0F}]?|\x{2663}[\x{FE0E}\x{FE0F}]?|\x{2665}[\x{FE0E}\x{FE0F}]?|\x{2666}[\x{FE0E}\x{FE0F}]?|\x{2668}[\x{FE0E}\x{FE0F}]?|\x{267B}[\x{FE0E}\x{FE0F}]?|\x{267E}[\x{FE0E}\x{FE0F}]?|\x{267F}[\x{FE0E}\x{FE0F}]?|\x{2692}[\x{FE0E}\x{FE0F}]?|\x{2693}[\x{FE0E}\x{FE0F}]?|\x{2694}[\x{FE0E}\x{FE0F}]?|\x{2695}[\x{FE0E}\x{FE0F}]?|\x{2696}[\x{FE0E}\x{FE0F}]?|\x{2697}[\x{FE0E}\x{FE0F}]?|\x{2699}[\x{FE0E}\x{FE0F}]?|\x{269B}[\x{FE0E}\x{FE0F}]?|\x{269C}[\x{FE0E}\x{FE0F}]?|\x{26A0}[\x{FE0E}\x{FE0F}]?|\x{26A1}[\x{FE0E}\x{FE0F}]?|\x{26AA}[\x{FE0E}\x{FE0F}]?|\x{26AB}[\x{FE0E}\x{FE0F}]?|\x{26B0}[\x{FE0E}\x{FE0F}]?|\x{26B1}[\x{FE0E}\x{FE0F}]?|\x{26BD}[\x{FE0E}\x{FE0F}]?|\x{26BE}[\x{FE0E}\x{FE0F}]?|\x{26C4}[\x{FE0E}\x{FE0F}]?|\x{26C5}[\x{FE0E}\x{FE0F}]?|\x{26C8}[\x{FE0E}\x{FE0F}]?|\x{26CF}[\x{FE0E}\x{FE0F}]?|\x{26D1}[\x{FE0E}\x{FE0F}]?|\x{26D3}[\x{FE0E}\x{FE0F}]?|\x{26D4}[\x{FE0E}\x{FE0F}]?|\x{26E9}[\x{FE0E}\x{FE0F}]?|\x{26EA}[\x{FE0E}\x{FE0F}]?|\x{26F0}[\x{FE0E}\x{FE0F}]?|\x{26F1}[\x{FE0E}\x{FE0F}]?|\x{26F2}[\x{FE0E}\x{FE0F}]?|\x{26F3}[\x{FE0E}\x{FE0F}]?|\x{26F4}[\x{FE0E}\x{FE0F}]?|\x{26F5}[\x{FE0E}\x{FE0F}]?|\x{26F7}[\x{FE0E}\x{FE0F}]?|\x{26F8}[\x{FE0E}\x{FE0F}]?|\x{26FA}[\x{FE0E}\x{FE0F}]?|\x{26FD}[\x{FE0E}\x{FE0F}]?|\x{2702}[\x{FE0E}\x{FE0F}]?|\x{2708}[\x{FE0E}\x{FE0F}]?|\x{2709}[\x{FE0E}\x{FE0F}]?|\x{270F}[\x{FE0E}\x{FE0F}]?|\x{2712}[\x{FE0E}\x{FE0F}]?|\x{2733}[\x{FE0E}\x{FE0F}]?|\x{2734}[\x{FE0E}\x{FE0F}]?|\x{2753}[\x{FE0E}\x{FE0F}]?|\x{2763}[\x{FE0E}\x{FE0F}]?|\x{2764}[\x{FE0E}\x{FE0F}]?|\x{2934}[\x{FE0E}\x{FE0F}]?|\x{2935}[\x{FE0E}\x{FE0F}]?|\x{2B05}[\x{FE0E}\x{FE0F}]?|\x{2B06}[\x{FE0E}\x{FE0F}]?|\x{2B07}[\x{FE0E}\x{FE0F}]?|\x{2B1B}[\x{FE0E}\x{FE0F}]?|\x{2B1C}[\x{FE0E}\x{FE0F}]?|\x{1F004}[\x{FE0E}\x{FE0F}]?|\x{1F170}[\x{FE0E}\x{FE0F}]?|\x{1F171}[\x{FE0E}\x{FE0F}]?|\x{1F1FC}[\x{1F1EB}\x{1F1F8}]?|\x{1F1FE}[\x{1F1EA}\x{1F1F9}]?|\x{1F202}[\x{FE0E}\x{FE0F}]?|\x{1F237}[\x{FE0E}\x{FE0F}]?|\x{1F30D}[\x{FE0E}\x{FE0F}]?|\x{1F30E}[\x{FE0E}\x{FE0F}]?|\x{1F30F}[\x{FE0E}\x{FE0F}]?|\x{1F315}[\x{FE0E}\x{FE0F}]?|\x{1F31C}[\x{FE0E}\x{FE0F}]?|\x{1F321}[\x{FE0E}\x{FE0F}]?|\x{1F324}[\x{FE0E}\x{FE0F}]?|\x{1F325}[\x{FE0E}\x{FE0F}]?|\x{1F326}[\x{FE0E}\x{FE0F}]?|\x{1F327}[\x{FE0E}\x{FE0F}]?|\x{1F328}[\x{FE0E}\x{FE0F}]?|\x{1F329}[\x{FE0E}\x{FE0F}]?|\x{1F32A}[\x{FE0E}\x{FE0F}]?|\x{1F32B}[\x{FE0E}\x{FE0F}]?|\x{1F32C}[\x{FE0E}\x{FE0F}]?|\x{1F378}[\x{FE0E}\x{FE0F}]?|\x{1F393}[\x{FE0E}\x{FE0F}]?|\x{1F396}[\x{FE0E}\x{FE0F}]?|\x{1F397}[\x{FE0E}\x{FE0F}]?|\x{1F399}[\x{FE0E}\x{FE0F}]?|\x{1F39A}[\x{FE0E}\x{FE0F}]?|\x{1F39B}[\x{FE0E}\x{FE0F}]?|\x{1F39E}[\x{FE0E}\x{FE0F}]?|\x{1F39F}[\x{FE0E}\x{FE0F}]?|\x{1F3A7}[\x{FE0E}\x{FE0F}]?|\x{1F3AC}[\x{FE0E}\x{FE0F}]?|\x{1F3AD}[\x{FE0E}\x{FE0F}]?|\x{1F3AE}[\x{FE0E}\x{FE0F}]?|\x{1F3C6}[\x{FE0E}\x{FE0F}]?|\x{1F3CD}[\x{FE0E}\x{FE0F}]?|\x{1F3CE}[\x{FE0E}\x{FE0F}]?|\x{1F3D4}[\x{FE0E}\x{FE0F}]?|\x{1F3D5}[\x{FE0E}\x{FE0F}]?|\x{1F3D6}[\x{FE0E}\x{FE0F}]?|\x{1F3D7}[\x{FE0E}\x{FE0F}]?|\x{1F3D8}[\x{FE0E}\x{FE0F}]?|\x{1F3D9}[\x{FE0E}\x{FE0F}]?|\x{1F3DA}[\x{FE0E}\x{FE0F}]?|\x{1F3DB}[\x{FE0E}\x{FE0F}]?|\x{1F3DC}[\x{FE0E}\x{FE0F}]?|\x{1F3DD}[\x{FE0E}\x{FE0F}]?|\x{1F3DE}[\x{FE0E}\x{FE0F}]?|\x{1F3DF}[\x{FE0E}\x{FE0F}]?|\x{1F3E0}[\x{FE0E}\x{FE0F}]?|\x{1F3ED}[\x{FE0E}\x{FE0F}]?|\x{1F3F4}[\x{200D}\x{E0067}]?|\x{1F3F5}[\x{FE0E}\x{FE0F}]?|\x{1F3F7}[\x{FE0E}\x{FE0F}]?|\x{1F408}[\x{FE0E}\x{FE0F}]?|\x{1F41F}[\x{FE0E}\x{FE0F}]?|\x{1F426}[\x{FE0E}\x{FE0F}]?|\x{1F441}[\x{200D}\x{FE0E}\x{FE0F}]|\x{1F453}[\x{FE0E}\x{FE0F}]?|\x{1F46A}[\x{FE0E}\x{FE0F}]?|\x{1F47D}[\x{FE0E}\x{FE0F}]?|\x{1F4A3}[\x{FE0E}\x{FE0F}]?|\x{1F4B0}[\x{FE0E}\x{FE0F}]?|\x{1F4B3}[\x{FE0E}\x{FE0F}]?|\x{1F4BB}[\x{FE0E}\x{FE0F}]?|\x{1F4BF}[\x{FE0E}\x{FE0F}]?|\x{1F4CB}[\x{FE0E}\x{FE0F}]?|\x{1F4DA}[\x{FE0E}\x{FE0F}]?|\x{1F4DF}[\x{FE0E}\x{FE0F}]?|\x{1F4E4}[\x{FE0E}\x{FE0F}]?|\x{1F4E5}[\x{FE0E}\x{FE0F}]?|\x{1F4E6}[\x{FE0E}\x{FE0F}]?|\x{1F4EA}[\x{FE0E}\x{FE0F}]?|\x{1F4EB}[\x{FE0E}\x{FE0F}]?|\x{1F4EC}[\x{FE0E}\x{FE0F}]?|\x{1F4ED}[\x{FE0E}\x{FE0F}]?|\x{1F4F7}[\x{FE0E}\x{FE0F}]?|\x{1F4F9}[\x{FE0E}\x{FE0F}]?|\x{1F4FA}[\x{FE0E}\x{FE0F}]?|\x{1F4FB}[\x{FE0E}\x{FE0F}]?|\x{1F4FD}[\x{FE0E}\x{FE0F}]?|\x{1F508}[\x{FE0E}\x{FE0F}]?|\x{1F50D}[\x{FE0E}\x{FE0F}]?|\x{1F512}[\x{FE0E}\x{FE0F}]?|\x{1F513}[\x{FE0E}\x{FE0F}]?|\x{1F549}[\x{FE0E}\x{FE0F}]?|\x{1F54A}[\x{FE0E}\x{FE0F}]?|\x{1F550}[\x{FE0E}\x{FE0F}]?|\x{1F551}[\x{FE0E}\x{FE0F}]?|\x{1F552}[\x{FE0E}\x{FE0F}]?|\x{1F553}[\x{FE0E}\x{FE0F}]?|\x{1F554}[\x{FE0E}\x{FE0F}]?|\x{1F555}[\x{FE0E}\x{FE0F}]?|\x{1F556}[\x{FE0E}\x{FE0F}]?|\x{1F557}[\x{FE0E}\x{FE0F}]?|\x{1F558}[\x{FE0E}\x{FE0F}]?|\x{1F559}[\x{FE0E}\x{FE0F}]?|\x{1F55A}[\x{FE0E}\x{FE0F}]?|\x{1F55B}[\x{FE0E}\x{FE0F}]?|\x{1F55C}[\x{FE0E}\x{FE0F}]?|\x{1F55D}[\x{FE0E}\x{FE0F}]?|\x{1F55E}[\x{FE0E}\x{FE0F}]?|\x{1F55F}[\x{FE0E}\x{FE0F}]?|\x{1F560}[\x{FE0E}\x{FE0F}]?|\x{1F561}[\x{FE0E}\x{FE0F}]?|\x{1F562}[\x{FE0E}\x{FE0F}]?|\x{1F563}[\x{FE0E}\x{FE0F}]?|\x{1F564}[\x{FE0E}\x{FE0F}]?|\x{1F565}[\x{FE0E}\x{FE0F}]?|\x{1F566}[\x{FE0E}\x{FE0F}]?|\x{1F567}[\x{FE0E}\x{FE0F}]?|\x{1F56F}[\x{FE0E}\x{FE0F}]?|\x{1F570}[\x{FE0E}\x{FE0F}]?|\x{1F573}[\x{FE0E}\x{FE0F}]?|\x{1F576}[\x{FE0E}\x{FE0F}]?|\x{1F577}[\x{FE0E}\x{FE0F}]?|\x{1F578}[\x{FE0E}\x{FE0F}]?|\x{1F579}[\x{FE0E}\x{FE0F}]?|\x{1F587}[\x{FE0E}\x{FE0F}]?|\x{1F58A}[\x{FE0E}\x{FE0F}]?|\x{1F58B}[\x{FE0E}\x{FE0F}]?|\x{1F58C}[\x{FE0E}\x{FE0F}]?|\x{1F58D}[\x{FE0E}\x{FE0F}]?|\x{1F5A5}[\x{FE0E}\x{FE0F}]?|\x{1F5A8}[\x{FE0E}\x{FE0F}]?|\x{1F5B1}[\x{FE0E}\x{FE0F}]?|\x{1F5B2}[\x{FE0E}\x{FE0F}]?|\x{1F5BC}[\x{FE0E}\x{FE0F}]?|\x{1F5C2}[\x{FE0E}\x{FE0F}]?|\x{1F5C3}[\x{FE0E}\x{FE0F}]?|\x{1F5C4}[\x{FE0E}\x{FE0F}]?|\x{1F5D1}[\x{FE0E}\x{FE0F}]?|\x{1F5D2}[\x{FE0E}\x{FE0F}]?|\x{1F5D3}[\x{FE0E}\x{FE0F}]?|\x{1F5DC}[\x{FE0E}\x{FE0F}]?|\x{1F5DD}[\x{FE0E}\x{FE0F}]?|\x{1F5DE}[\x{FE0E}\x{FE0F}]?|\x{1F5E1}[\x{FE0E}\x{FE0F}]?|\x{1F5E3}[\x{FE0E}\x{FE0F}]?|\x{1F5E8}[\x{FE0E}\x{FE0F}]?|\x{1F5EF}[\x{FE0E}\x{FE0F}]?|\x{1F5F3}[\x{FE0E}\x{FE0F}]?|\x{1F5FA}[\x{FE0E}\x{FE0F}]?|\x{1F610}[\x{FE0E}\x{FE0F}]?|\x{1F687}[\x{FE0E}\x{FE0F}]?|\x{1F68D}[\x{FE0E}\x{FE0F}]?|\x{1F691}[\x{FE0E}\x{FE0F}]?|\x{1F694}[\x{FE0E}\x{FE0F}]?|\x{1F698}[\x{FE0E}\x{FE0F}]?|\x{1F6AD}[\x{FE0E}\x{FE0F}]?|\x{1F6B2}[\x{FE0E}\x{FE0F}]?|\x{1F6B9}[\x{FE0E}\x{FE0F}]?|\x{1F6BA}[\x{FE0E}\x{FE0F}]?|\x{1F6BC}[\x{FE0E}\x{FE0F}]?|\x{1F6CB}[\x{FE0E}\x{FE0F}]?|\x{1F6CD}[\x{FE0E}\x{FE0F}]?|\x{1F6CE}[\x{FE0E}\x{FE0F}]?|\x{1F6CF}[\x{FE0E}\x{FE0F}]?|\x{1F6E0}[\x{FE0E}\x{FE0F}]?|\x{1F6E1}[\x{FE0E}\x{FE0F}]?|\x{1F6E2}[\x{FE0E}\x{FE0F}]?|\x{1F6E3}[\x{FE0E}\x{FE0F}]?|\x{1F6E4}[\x{FE0E}\x{FE0F}]?|\x{1F6E5}[\x{FE0E}\x{FE0F}]?|\x{1F6E9}[\x{FE0E}\x{FE0F}]?|\x{1F6F0}[\x{FE0E}\x{FE0F}]?|\x{1F6F3}[\x{FE0E}\x{FE0F}]?|\xA9[\x{FE0E}\x{FE0F}]|\xAE[\x{FE0E}\x{FE0F}]|\x{203C}[\x{FE0E}\x{FE0F}]|\x{2049}[\x{FE0E}\x{FE0F}]|\x{2122}[\x{FE0E}\x{FE0F}]|\x{2139}[\x{FE0E}\x{FE0F}]|\x{2328}[\x{FE0E}\x{FE0F}]|\x{23CF}[\x{FE0E}\x{FE0F}]|\x{24C2}[\x{FE0E}\x{FE0F}]|\x{25B6}[\x{FE0E}\x{FE0F}]|\x{25C0}[\x{FE0E}\x{FE0F}]|\x{2618}[\x{FE0E}\x{FE0F}]|\x{2714}[\x{FE0E}\x{FE0F}]|\x{2716}[\x{FE0E}\x{FE0F}]|\x{271D}[\x{FE0E}\x{FE0F}]|\x{2721}[\x{FE0E}\x{FE0F}]|\x{2744}[\x{FE0E}\x{FE0F}]|\x{2747}[\x{FE0E}\x{FE0F}]|\x{2757}[\x{FE0E}\x{FE0F}]|\x{27A1}[\x{FE0E}\x{FE0F}]|\x{2B50}[\x{FE0E}\x{FE0F}]|\x{2B55}[\x{FE0E}\x{FE0F}]|\x{3030}[\x{FE0E}\x{FE0F}]|\x{303D}[\x{FE0E}\x{FE0F}]|\x{3297}[\x{FE0E}\x{FE0F}]|\x{3299}[\x{FE0E}\x{FE0F}]|\x{1F17E}[\x{FE0E}\x{FE0F}]|\x{1F17F}[\x{FE0E}\x{FE0F}]|\x{1F21A}[\x{FE0E}\x{FE0F}]|\x{1F22F}[\x{FE0E}\x{FE0F}]|\x{1F336}[\x{FE0E}\x{FE0F}]|\x{1F37D}[\x{FE0E}\x{FE0F}]|\x{1F43F}[\x{FE0E}\x{FE0F}]|\x{1F1F4}\x{1F1F2}?|\x{1F1F6}\x{1F1E6}?|\x{1F1FD}\x{1F1F0}?|\x{1F46F}\x{200D}?|\x{1F93C}\x{200D}?|\x{1F9DE}\x{200D}?|\x{1F9DF}\x{200D}?)
daxim
  • 39,270
  • 4
  • 65
  • 132
  • It's a good idea to parse all emoji-*.txt files.(At least we can be sure that they are all possible emojis!) But this regex is too long and it may be slow on long text. – HosSeinM Nov 05 '19 at 20:35
  • I can't see how it could be possibly slow. Length of expression does not kill performance, exponential backtracking does. But this one is just a bunch of alternations with maximum 1 step of backtracking. Have you benchmarked it? – Strategies for shrinking the visible length of the code (makes no difference at run-time): 1. Replace seven most common suffixes with variables (45% reduction) 2. Replace escapes with literal characters (30% reduction) – daxim Nov 06 '19 at 13:09
0

This is the equivalent version in PHP:

preg_replace("/\u{00a9}|\u{00ae}|[\u{2000}-\u{3300}]|[\u{1e400}-\u{1f3ff}]|[\u{1e800}-\u{1f7ff}]|[\u{1ec00}-\u{1fbff}]/u",'', $value);

To create it I have converted the surrogate ranges to int, thanks to: How to convert between a Unicode/UCS codepoint and a UTF16 surrogate pair?

// PHP equivalent
function combine($surrogateHigh, $surrogateLow){
    return (($surrogateHigh - 0xd800) * 0x400) + ($surrogateLow - 0xdc00) + 0x10000;
}

Then I have converted the ranges

echo dechex(combine(0xd83c, 0xd000)). "\n";
echo dechex(combine(0xd83c, 0xdfff)). "\n";

echo dechex(combine(0xd83d, 0xd000)). "\n";
echo dechex(combine(0xd83d, 0xdfff)). "\n";

echo dechex(combine(0xd83e, 0xd000)). "\n";
echo dechex(combine(0xd83e, 0xdfff)). "\n";

0

As far as PHP goes, you can json_encode() the string you're trying to apply the "illegal" REGEX pattern on, and this will convert the string to UTF-8 friendly chars.

From there you can just check for the literal unicode string:

$value = "Sup ";
$res = json_decode(preg_replace('/\\\ud83d\\\udeab/i', 'REPLACED', json_encode($value)));
// $res is now "Sup REPLACED", yes some emojis are made up of 2 unicodes :\

Note: I wrapped it in a json_decode() to get the original string back.

Also, >= 0xd800 && <= 0xdfff just says any unicode in that Hex range will throw this error. The emoji I used in my example above is indeed in the illegal range.

Downside: You can't apply Hex ranges with this solution, you'll have to know which emojis are problematic exactly, and deal with them precisely (i.e. '/' . implode('|', EmojiClass::BAD_EMOJI_HEXES_ARRAY) . '/i')