8

Unicode text segmentation requires access to the Grapheme_Cluster_Break property of characters. Which JavaScript famously doesn't provide in a direct way. I was hoping I would be able to use Unicode property escapes in a regexp to work around this, but that doesn't seem to be as simple as /\p{Grapheme_Cluster_Break=Extend}/u or something like that. You can do \p{Grapheme_Extend}, but that tests for something different.

Is there a way to trick JavaScript runtimes into giving me information about characters' Grapheme_Cluster_Break value through property escapes? (And if not, why not?)

Marijn
  • 8,691
  • 2
  • 34
  • 37
  • Does this answer your question? [ES2015 Unicode regular expression transpiler](https://mothereff.in/regexpu#input=var+regex+%3D+/%5Cp%7BGrapheme_Extend%7D/u%3B&unicodePropertyEscape=1) I'm working on an equivalent ES5 regex shorter than 3677 chars… – JosefZ Jun 10 '20 at 19:30
  • No, that's doesn't answer the question. – Marijn Jun 11 '20 at 13:54
  • These the GCB's needed `\p{Grapheme_Cluster_Break=Control} \p{Grapheme_Cluster_Break=CR} \p{Grapheme_Cluster_Break=Extend} \p{Grapheme_Cluster_Break=L} \p{Grapheme_Cluster_Break=LF} \p{Grapheme_Cluster_Break=LV} \p{Grapheme_Cluster_Break=LVT} \p{Grapheme_Cluster_Break=Prepend} \p{Grapheme_Cluster_Break=Regional_Indicator} \p{Grapheme_Cluster_Break=SpacingMark} \p{Grapheme_Cluster_Break=T} \p{Grapheme_Cluster_Break=V} \p{Grapheme_Cluster_Break=ZWJ}` ? if so do yuo need them as a whole `[all props]` or need to know which one matched `(\p{1})|(\p{2})|...(\p{13})` ? –  Jun 12 '20 at 18:37
  • Abuve there deskribe 2 regex. I kan give yuo both using latest UCD info from Unicode 13. Combined there are 13 props of intereested, that match collectively 17,839 char units. The last is uv corset broken down to individualf props for convenkcients –  Jun 12 '20 at 18:41
  • Uv course there are other aux type GCB properties not included there, example `\p{Grapheme_Cluster_Break=Other} \p{Grapheme_Cluster_Break=E_Base} \p{Grapheme_Cluster_Break=E_Base_GAZ} \p{Grapheme_Cluster_Break=E_Modifier} \p{Grapheme_Cluster_Break=Glue_After_Zwj}` –  Jun 12 '20 at 18:46
  • Those `\p` expressions aren't valid (or at least, are not accepted by any current engine I tried). – Marijn Jun 13 '20 at 17:49
  • You might want to check the latest version ecma-262 regex properties it supports or if it does. Then check if `\p{Grapheme_Cluster_Break=Extend}` is one of them. Sumtimz they change the words but mean the same, so check thait. If that dont work you have not choice, use the regex i provide and you have the latest. –  Jun 14 '20 at 19:20

2 Answers2

2

Re: Grapheme_Cluster_Break=Extend

Minus unassigned and non-chars, UCD shows 1725 chars in Unicode 13.

They can be matched using utf-16 regex ranges below.

This regex uses Unicode chars to describe and is about 1977 chars in length.
Only factored unpaired surrogates are converted to \U notation.

(?:[̀-ͯ҃-҉֑-ׇֽֿׁׂׅׄؐ-ًؚ-ٰٟۖ-ۜ۟-۪ۤۧۨ-ܑۭܰ-݊ަ-ް߫-߽߳ࠖ-࠙ࠛ-ࠣࠥ-ࠧࠩ-࡙࠭-࡛࣓-ࣣ࣡-ंऺ़ु-ै्॑-ॗॢॣঁ়াু-ৄ্ৗৢৣ৾ਁਂ਼ੁੂੇੈੋ-੍ੑੰੱੵઁં઼ુ-ૅેૈ્ૢૣૺ-૿ଁ଼ାିୁ-ୄ୍୕-ୗୢୣஂாீ்ௗఀఄా-ీె-ైొ-్ౕౖౢౣಁ಼ಿೂೆೌ್ೕೖೢೣഀഁ഻഼ാു-ൄ്ൗൢൣඁ්ාි-ුූෟัิ-ฺ็-๎ັິ-ຼ່-ໍཱ༹༘༙༵༷-ཾྀ-྄྆྇ྍ-ྗྙ-ྼ࿆ိ-ူဲ-့္်ွှၘၙၞ-ၠၱ-ၴႂႅႆႍႝ፝-፟ᜒ-᜔ᜲ-᜴ᝒᝓᝲᝳ឴឵ិ-ួំ៉-៓៝ᢅᢆᢩᤠ-ᤢᤧᤨᤲ᤹-᤻ᨘᨗᨛᩖᩘ-ᩞ᩠ᩢᩥ-ᩬᩳ-᩿᩼᪰-ᫀᬀ-ᬃ᬴-ᬺᬼᭂ᭫-᭳ᮀᮁᮢ-ᮥᮨᮩ᮫-ᮭ᯦ᯨᯩᯭᯯ-ᯱᰬ-ᰳᰶ᰷᳐-᳔᳒-᳢᳠-᳨᳭᳴᳸᳹᷀-᷹᷻-᷿‌⃐-⃰⳯-⵿⳱ⷠ-〪ⷿ-゙゚〯꙯-꙲ꙴ-꙽ꚞꚟ꛰꛱ꠂ꠆ꠋꠥꠦ꠬꣄ꣅ꣠-꣱ꣿꤦ-꤭ꥇ-ꥑꦀ-ꦂ꦳ꦶ-ꦹꦼꦽꧥꨩ-ꨮꨱꨲꨵꨶꩃꩌꩼꪰꪲ-ꪴꪷꪸꪾ꪿꫁ꫬꫭ꫶ꯥꯨ꯭]|\ud800[\uddfd\udee0\udf76-\udf7a]|\ud802[\ude01-\ude03\ude05\ude06\ude0c-\ude0f\ude38-\ude3a\ude3f\udee5\udee6]|\ud803[\udd24-\udd27\udeab\udeac\udf46-\udf50]|\ud804[\udc01\udc38-\udc46\udc7f-\udc81\udcb3-\udcb6\udcb9\udcba\udd00-\udd02\udd27-\udd2b\udd2d-\udd34\udd73\udd80\udd81\uddb6-\uddbe\uddc9-\uddcc\uddcf\ude2f-\ude31\ude34\ude36\ude37\ude3e\udedf\udee3-\udeea\udf00\udf01\udf3b\udf3c\udf3e\udf40\udf57\udf66-\udf6c\udf70-\udf74]|\ud805[\udc38-\udc3f\udc42-\udc44\udc46\udc5e\udcb0\udcb3-\udcb8\udcba\udcbd\udcbf\udcc0\udcc2\udcc3\uddaf\uddb2-\uddb5\uddbc\uddbd\uddbf\uddc0\udddc\udddd\ude33-\ude3a\ude3d\ude3f\ude40\udeab\udead\udeb0-\udeb5\udeb7\udf1d-\udf1f\udf22-\udf25\udf27-\udf2b]|\ud806[\udc2f-\udc37\udc39\udc3a\udd30\udd3b\udd3c\udd3e\udd43\uddd4-\uddd7\uddda\udddb\udde0\ude01-\ude0a\ude33-\ude38\ude3b-\ude3e\ude47\ude51-\ude56\ude59-\ude5b\ude8a-\ude96\ude98\ude99]|\ud807[\udc30-\udc36\udc38-\udc3d\udc3f\udc92-\udca7\udcaa-\udcb0\udcb2\udcb3\udcb5\udcb6\udd31-\udd36\udd3a\udd3c\udd3d\udd3f-\udd45\udd47\udd90\udd91\udd95\udd97\udef3\udef4]|\ud81a[\udef0-\udef4\udf30-\udf36]|\ud81b[\udf4f\udf8f-\udf92\udfe4]|\ud82f[\udc9d\udc9e]|\ud834[\udd65\udd67-\udd69\udd6e-\udd72\udd7b-\udd82\udd85-\udd8b\uddaa-\uddad\ude42-\ude44]|\ud836[\ude00-\ude36\ude3b-\ude6c\ude75\ude84\ude9b-\ude9f\udea1-\udeaf]|\ud838[\udc00-\udc06\udc08-\udc18\udc1b-\udc21\udc23\udc24\udc26-\udc2a\udd30-\udd36\udeec-\udeef]|\ud83a[\udcd0-\udcd6\udd44-\udd4a]|\ud83c[\udffb-\udfff]|\udb40[\udc20-\udc7f]|[ﬞ︠-゙゚︯])

demo

This regex is the same but all Unicode chars are converted to the \U notation, ends up being 3670 in length

(?:[\u0300-\u036f\u0483-\u0489\u0591-\u05bd\u05bf\u05c1\u05c2\u05c4\u05c5\u05c7\u0610-\u061a\u064b-\u065f\u0670\u06d6-\u06dc\u06df-\u06e4\u06e7\u06e8\u06ea-\u06ed\u0711\u0730-\u074a\u07a6-\u07b0\u07eb-\u07f3\u07fd\u0816-\u0819\u081b-\u0823\u0825-\u0827\u0829-\u082d\u0859-\u085b\u08d3-\u08e1\u08e3-\u0902\u093a\u093c\u0941-\u0948\u094d\u0951-\u0957\u0962\u0963\u0981\u09bc\u09be\u09c1-\u09c4\u09cd\u09d7\u09e2\u09e3\u09fe\u0a01\u0a02\u0a3c\u0a41\u0a42\u0a47\u0a48\u0a4b-\u0a4d\u0a51\u0a70\u0a71\u0a75\u0a81\u0a82\u0abc\u0ac1-\u0ac5\u0ac7\u0ac8\u0acd\u0ae2\u0ae3\u0afa-\u0aff\u0b01\u0b3c\u0b3e\u0b3f\u0b41-\u0b44\u0b4d\u0b55-\u0b57\u0b62\u0b63\u0b82\u0bbe\u0bc0\u0bcd\u0bd7\u0c00\u0c04\u0c3e-\u0c40\u0c46-\u0c48\u0c4a-\u0c4d\u0c55\u0c56\u0c62\u0c63\u0c81\u0cbc\u0cbf\u0cc2\u0cc6\u0ccc\u0ccd\u0cd5\u0cd6\u0ce2\u0ce3\u0d00\u0d01\u0d3b\u0d3c\u0d3e\u0d41-\u0d44\u0d4d\u0d57\u0d62\u0d63\u0d81\u0dca\u0dcf\u0dd2-\u0dd4\u0dd6\u0ddf\u0e31\u0e34-\u0e3a\u0e47-\u0e4e\u0eb1\u0eb4-\u0ebc\u0ec8-\u0ecd\u0f18\u0f19\u0f35\u0f37\u0f39\u0f71-\u0f7e\u0f80-\u0f84\u0f86\u0f87\u0f8d-\u0f97\u0f99-\u0fbc\u0fc6\u102d-\u1030\u1032-\u1037\u1039\u103a\u103d\u103e\u1058\u1059\u105e-\u1060\u1071-\u1074\u1082\u1085\u1086\u108d\u109d\u135d-\u135f\u1712-\u1714\u1732-\u1734\u1752\u1753\u1772\u1773\u17b4\u17b5\u17b7-\u17bd\u17c6\u17c9-\u17d3\u17dd\u1885\u1886\u18a9\u1920-\u1922\u1927\u1928\u1932\u1939-\u193b\u1a17\u1a18\u1a1b\u1a56\u1a58-\u1a5e\u1a60\u1a62\u1a65-\u1a6c\u1a73-\u1a7c\u1a7f\u1ab0-\u1ac0\u1b00-\u1b03\u1b34-\u1b3a\u1b3c\u1b42\u1b6b-\u1b73\u1b80\u1b81\u1ba2-\u1ba5\u1ba8\u1ba9\u1bab-\u1bad\u1be6\u1be8\u1be9\u1bed\u1bef-\u1bf1\u1c2c-\u1c33\u1c36\u1c37\u1cd0-\u1cd2\u1cd4-\u1ce0\u1ce2-\u1ce8\u1ced\u1cf4\u1cf8\u1cf9\u1dc0-\u1df9\u1dfb-\u1dff\u200c\u20d0-\u20f0\u2cef-\u2cf1\u2d7f\u2de0-\u2dff\u302a-\u302f\u3099\u309a\ua66f-\ua672\ua674-\ua67d\ua69e\ua69f\ua6f0\ua6f1\ua802\ua806\ua80b\ua825\ua826\ua82c\ua8c4\ua8c5\ua8e0-\ua8f1\ua8ff\ua926-\ua92d\ua947-\ua951\ua980-\ua982\ua9b3\ua9b6-\ua9b9\ua9bc\ua9bd\ua9e5\uaa29-\uaa2e\uaa31\uaa32\uaa35\uaa36\uaa43\uaa4c\uaa7c\uaab0\uaab2-\uaab4\uaab7\uaab8\uaabe\uaabf\uaac1\uaaec\uaaed\uaaf6\uabe5\uabe8\uabed]|\ud800[\uddfd\udee0\udf76-\udf7a]|\ud802[\ude01-\ude03\ude05\ude06\ude0c-\ude0f\ude38-\ude3a\ude3f\udee5\udee6]|\ud803[\udd24-\udd27\udeab\udeac\udf46-\udf50]|\ud804[\udc01\udc38-\udc46\udc7f-\udc81\udcb3-\udcb6\udcb9\udcba\udd00-\udd02\udd27-\udd2b\udd2d-\udd34\udd73\udd80\udd81\uddb6-\uddbe\uddc9-\uddcc\uddcf\ude2f-\ude31\ude34\ude36\ude37\ude3e\udedf\udee3-\udeea\udf00\udf01\udf3b\udf3c\udf3e\udf40\udf57\udf66-\udf6c\udf70-\udf74]|\ud805[\udc38-\udc3f\udc42-\udc44\udc46\udc5e\udcb0\udcb3-\udcb8\udcba\udcbd\udcbf\udcc0\udcc2\udcc3\uddaf\uddb2-\uddb5\uddbc\uddbd\uddbf\uddc0\udddc\udddd\ude33-\ude3a\ude3d\ude3f\ude40\udeab\udead\udeb0-\udeb5\udeb7\udf1d-\udf1f\udf22-\udf25\udf27-\udf2b]|\ud806[\udc2f-\udc37\udc39\udc3a\udd30\udd3b\udd3c\udd3e\udd43\uddd4-\uddd7\uddda\udddb\udde0\ude01-\ude0a\ude33-\ude38\ude3b-\ude3e\ude47\ude51-\ude56\ude59-\ude5b\ude8a-\ude96\ude98\ude99]|\ud807[\udc30-\udc36\udc38-\udc3d\udc3f\udc92-\udca7\udcaa-\udcb0\udcb2\udcb3\udcb5\udcb6\udd31-\udd36\udd3a\udd3c\udd3d\udd3f-\udd45\udd47\udd90\udd91\udd95\udd97\udef3\udef4]|\ud81a[\udef0-\udef4\udf30-\udf36]|\ud81b[\udf4f\udf8f-\udf92\udfe4]|\ud82f[\udc9d\udc9e]|\ud834[\udd65\udd67-\udd69\udd6e-\udd72\udd7b-\udd82\udd85-\udd8b\uddaa-\uddad\ude42-\ude44]|\ud836[\ude00-\ude36\ude3b-\ude6c\ude75\ude84\ude9b-\ude9f\udea1-\udeaf]|\ud838[\udc00-\udc06\udc08-\udc18\udc1b-\udc21\udc23\udc24\udc26-\udc2a\udd30-\udd36\udeec-\udeef]|\ud83a[\udcd0-\udcd6\udd44-\udd4a]|\ud83c[\udffb-\udfff]|\udb40[\udc20-\udc7f]|[\ufb1e\ufe20-\ufe2f\uff9e\uff9f])

demo

  • I understand you can just enumerate all ranges into a regexp, but the question was whether it is possible to directly query for that property using a property escape. – Marijn Jun 13 '20 at 17:47
  • Dont know what or if latest emca-#### regex specs support properties `\p{}` or not. It never in past did. Best to read/search the latest spec for if and which properties it supports. Have yuo checked ? Whats in my answer is the values obtained by querying the UCD using ICU property `\p{Grapheme_Cluster_Break=Extend}`. The 1725 chars returned represent all characters in Unicode 13 having that propertty (as well as other prop's ). The result is factored into a regex and is what is offered in the answer. Its not an enumeration at all but a regex that is %100 equivalent. –  Jun 14 '20 at 19:08
  • Also, most engines `enumerate` an unordered list (not ICU), Java is good example. Where the property like `\p{Grapheme_Cluster_Break=Extend}` is a state that calls a function to check of the target character is in the list. Unfeartunutly, engines/langs don't update those _lists_ with every release of Unicode (at 13 now). That is the dilemma. Sorrey, dem dar faks. I could have easily produced a utf-32/8 regex as well, but since JS is a utf-16 thing, that is what proverded –  Jun 14 '20 at 19:12
0

According to the ECMAScript specification, the answer is sadly no.

If you look at the section for UnicodeMatchProperty: https://tc39.es/ecma262/#sec-runtime-semantics-unicodematchproperty-p

You can see that while a whole host of binary Unicode properties are included (including e.g. Grapheme_Extend), the only Unicode property values that are supported are General_Category, Script, and Script_Extensions.

Grapheme_Cluster_Break is not included, unfortunately.

I don't know why the specification would specifically include some Unicode properties but not others. But there doesn't appear to be any trick to accessing it.

While it's not a RegExp, you can segment graphemes using Intl.Segmenter, though it's currently only built into Chrome and Safari -- Firefox still hasn't implemented it.

crazygringo
  • 1,324
  • 1
  • 9
  • 4