2

I want to match any full Unicode character. I'm probably using the wrong terms, but I don't necessarily mean letters; I want any displayed character with any modifiers included. Edit: I'm keeping my original wording, but upon review of this answer, perhaps grapheme is actually what I'm looking for.

Using the trivial regex ., with the Unicode u modifier, /./u does not fully suffice. A few examples:

  • ❤️ will instead match ❤ without the variation selector U+FE0F.
  • will only match without the pale skin tone U+1F3Fb.
  • à (U+0061 (a) followed by U+0300 (grave accent)) will only match the a.

Following this answer, I was able to expand the pattern to this: /.[\x{1f3fb}-\x{1f3ff}\p{M}]?/u. This matches all of my test characters above, as well as the three han unification characters I pulled from this web page.

Edit: I just realized this still doesn't fully match, because (at least in PHP) it fails to fully match ‍♂ (might not display properly on all devices), because it doesn't capture the male character U+2642.

At this point, it seems like a guessing game to me. I have a feeling there are a lot of edge cases my current regex will not cover, but I don't know enough about foreign alphabets nor am I ready to just start guessing and enumerating random emojis and symbols from the character map to fully test this.

Is there a simpler solution to actually match any character including its modifiers/combining marks/etc?

Edit: Per Rob's comment below, I'm using PHP 7.4 for the regex.

404 Not Found
  • 3,635
  • 2
  • 28
  • 34
  • If you use PHP, use `/\X/u` – Wiktor Stribiżew Dec 01 '22 at 22:10
  • 2
    The term you mean is "extended grapheme cluster." Yes, it's a bit of a mouthful. Every regex engine is a bit different. You'll need to identify the specific language or regex engine you mean. There's no one answer here. In a language like Swift, for example, you get this for free (you don't even need a regular expression at all), while in other languages you'll need something like libicu, so it matters which language you're working in. – Rob Napier Dec 01 '22 at 22:12
  • @WiktorStribiżew `/\X/u` is actually much better, I feel dumb for having not found it, but in my unit tests, if I have mutliple emojis next to each other like ‍♂‍♂️‍♀ it captures all of them as one match. – 404 Not Found Dec 01 '22 at 22:15
  • Right, emojis can contain multiple graphemes. In PCRE, the length of the regex pattern is limited, so there is no good way to match emojis with regex. – Wiktor Stribiżew Dec 01 '22 at 22:17
  • Can you give a short PHP example of your problem with `/\X/u`? The docs definitely suggest it should work. – Rob Napier Dec 01 '22 at 22:21
  • I'm not sure how well it displays in the browser, but this is what I'm testing now: `preg_match('/\X/u', '‍♂‍♀', $matches); var_dump($matches);` The intended test string is "Person Bowing: Medium-Dark Skin Tone", "Man Gesturing No", and "Woman Gesturing OK: Medium Skin Tone". What I'd expect is to be able to match each individually (3 graphemes), but `/\X\u` instead matches that full string as one match. – 404 Not Found Dec 01 '22 at 22:28
  • 1
    I can't generalize right now, but `'/\p{So}\p{Sk}*(?:\p{Cf}+\p{So})*/u'` works. See [PHP demo](https://3v4l.org/MJf9R). – Wiktor Stribiżew Dec 02 '22 at 09:23
  • Thanks @WiktorStribiżew, I'm still not sure if this will cover everything, but a slight modification of that into this, `/.[\p{M}\p{Sk}]*(?:\p{Cf}+\p{So})*/u` appears to be succeeding for all of my current test cases. I'll continue to experiment to see if there are any apparent issues with it. – 404 Not Found Dec 02 '22 at 15:52
  • Slight tweak: `/.[\p{Sk}]*(?:\p{Cf}+\p{So})*\p{M}*/u` – 404 Not Found Dec 02 '22 at 16:26

0 Answers0