1

I converted a regular expression taken from https://twemoji.maxcdn.com/v/latest/twemoji.js that matches the Unicode characters related to emojis from javascript to php.

The converted regex works as intended when I'm testing it with regex101.com

However when I test in my local environment its not working.

You can see a working example here https://regex101.com/r/IuIhBF/1

Here is the PHP version. http://sandbox.onlinephpfunctions.com/code/3bd5933f5230fc1c45104b7eccd9379b68870016

I tried changing the preg_match_all flags. Adding u to the regular expression ex: /*****/u

Can't get it to work

Would be great if somebody could help me solve that error: Compilation failed: range out of order in character class at offset 306.

  • Interestingly the error on PHP 5.x for that code is more informative perhaps: `Compilation failed: character value in \x{} or \o{} is too large at offset 10`. – ficuscr Jul 23 '19 at 22:14
  • Missing the `\u` (PCRE_UTF8) flag. Think I clobbered your PHP paste-bin example while testing that. – ficuscr Jul 23 '19 at 22:19

2 Answers2

1

This expression seems to be working on your samples, with a u flag:

$re = '/[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]/u';
$str = 'Time in emoji is very expressive.  allowed us to communicate time very easily.

Next up was negation. ❌️ means “No talk.”';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Reference

How do i match with regex special chars that are not alphanumeric whilst ignoring emojis?

Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    That regex misses quite a few emoji, as well as being dirt slow, almost 30 seconds for the Unicode emoji test text.. –  Jul 24 '19 at 00:01
1

For emoji, you should use Utf-16 surrogate pairs regex.
The utf-8/32 regex is way too slow.

See this link for a Unicode Version 12 emoji regex and test.
It takes 3.4 seconds, so if it times out (default is 2s), just up the timeout
in the settings.

The utf-8/32 regex takes almost 40 seconds by comparison (requires the //u flag).

So, definitely stick with surrogate pairs for emoji regex.

https://regex101.com/r/k61Df5/1