I want to match any full Unicode character. I'm probably using the wrong terms, but I don't necessarily mean letters; I want any displayed character with any modifiers included. Edit: I'm keeping my original wording, but upon review of this answer, perhaps grapheme is actually what I'm looking for.
Using the trivial regex .
, with the Unicode u
modifier, /./u
does not fully suffice. A few examples:
- ❤️ will instead match ❤ without the variation selector U+FE0F.
- will only match without the pale skin tone U+1F3Fb.
- à (U+0061 (a) followed by U+0300 (grave accent)) will only match the
a
.
Following this answer, I was able to expand the pattern to this: /.[\x{1f3fb}-\x{1f3ff}\p{M}]?/u
. This matches all of my test characters above, as well as the three han unification characters I pulled from this web page.
Edit: I just realized this still doesn't fully match, because (at least in PHP) it fails to fully match ♂ (might not display properly on all devices), because it doesn't capture the male character U+2642.
At this point, it seems like a guessing game to me. I have a feeling there are a lot of edge cases my current regex will not cover, but I don't know enough about foreign alphabets nor am I ready to just start guessing and enumerating random emojis and symbols from the character map to fully test this.
Is there a simpler solution to actually match any character including its modifiers/combining marks/etc?
Edit: Per Rob's comment below, I'm using PHP 7.4 for the regex.