10

Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.

From Unicode® Standard Annex #29:

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.

"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"\r\n".match(/*?*/)[0] === "\r\n"
"‍♂️foo".match(/*?*/)[0] === "‍♂️"
brainkim
  • 902
  • 3
  • 11
  • 20
  • 1
    You can use Unicode escape sequences to match single UTF-16 or double UTF-16 surrogate pairs, but matching a complete grapheme cluster would be really complicated I think. There's certainly no built-in way to do it; JavaScript regular expressions are woefully inadequate for general Unicode patterns like that. – Pointy Nov 07 '18 at 21:56
  • Please take a look at the marked question and if it didn't answer your problem edit accordingly. – revo Nov 07 '18 at 22:40
  • 1
    @revo I don't think this question is a duplicate of the marked question. Firstly, it's incredibly unclear in what it asks (it starts by asking how to replace emojis in a string with their emoticon equivalents, and shifts to asking how to replace emojis with their unicode escape sequence equivalents). Secondly, the question specifically asks about emojis and neither the question nor any of the answers address other types of grapheme clusters besides emojis (most emojis are represented by a single codepoint). How do I appeal this decision? – brainkim Nov 08 '18 at 00:26
  • 1
    Perl style regular expressions use `\X` to match one cluster, but unfortunately that doesn't seem to have been lifted for javascript flavor REs... If they support matching codepoints with given unicode properties you might be able to convert the EGC grammar in the Unicode spec to a RE, though. – Shawn Nov 08 '18 at 06:04
  • `\X` matches all characters (regardless of number of bytes e.g. `a`) along with grapheme clusters as one match. It works almost the same way as `\PM\pM*` which is supported by ES6 and could be transpiled to ES5 (for this you could use [this tool](https://mothereff.in/regexpu)). But there is a difference between those two that `\X` has some rules for Hangul syllables (which you used in your examples) that is it doesn't break on them. So you have to match them separately. See [this](https://stackoverflow.com/questions/41309402/) for more insights on Hangul. – revo Nov 08 '18 at 11:36
  • And you may not need `\PM\pM*` but `\PM\pM+`. Totally depends on your requirements. – revo Nov 08 '18 at 11:39
  • I should have referenced to https://www.unicode.org/faq/korean.html earlier. – revo Nov 08 '18 at 12:13
  • @Shawn @revo What I'm looking for is exactly the `\X` escape. Updated the question with a case to reflect this. I want to match not only extended grapheme clusters but also regular characters which are represented by a single code point. Not sure what `\P{M}\p{m}*` does (you have to add brackets in javascript unicode property escape regexes https://github.com/tc39/proposal-regexp-unicode-property-escapes#why-not-support-eg-pl-as-a-shorthand-for-pl – brainkim Nov 08 '18 at 14:01

1 Answers1

10

Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use \X. You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

In .NET, Java 8 and prior, and Ruby 1.9 you can use \P{M}\p{M}+ or (?>\P{M}\p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>\P{M}\p{M}*)+ as a substitute for \X+.

\X is the closest, and does not exist in any version through ES6. \P{M}\p{M}+ approximates \X, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(\P{Mark})(\p{Mark}+)/gu.

But even still, that isn't sufficient. <== Read that link for all the gory details.

A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.

Community
  • 1
  • 1
bishop
  • 37,830
  • 11
  • 104
  • 139