Does Vim have an equivalent to \X to match Unicode "grapheme clusters"?

Question

Unicode specifies that \X should match an "extened grapheme cluster" - for instance a base character followed by zero or more combining characters. (I believe this is a simplification but may suffice for my needs.)

I'm pretty sure at least Perl supports \X in its regular expresions.

But Vim defines \X to match a non-hexadecimal digit.

Does Vim have any equivalent to \X or any way to match a Unicode extended grapheme cluster?

Vim does have a concept of combining or "composing" characters, but its documentation does not cover whether or how they are supported in regular expressions.

It seems that Vim does not yet support this directly, but I am still interested in a workaround where a search will highlight all characters which include a combining character in at least the most basic range of U+0300 to U+0364.

What exactly do you want to do? Could you provide a sample case? Do you want to match such "characters" as à or Æ? — romainl, Jun 07 '12 at 13:31
I'm going to write some JavaScript code to convert between Georgian language characters and various official and ad-hoc transliteration schemes. Some such characters may involve combining characters so I want to make sure my tools are capable of working with them including telling me which text I find in the wild and paste in contains such characters. — hippietrail, Jun 07 '12 at 13:38
For instance, I might need to handle `J̌` (`004a 030c`). But more generally I just want to know whether Vim has or plans to have support for this, as it's becoming more and more common that us programmers have to deal with such things. — hippietrail, Jun 07 '12 at 13:41
Your example is matched with `/\%u004a\%u030c\Z`. You'll have to come up with a seriously big pattern if you want to highlight every possible combinations. The upside is that it will probably be portable to JS with "minimal" effort. Ho, and Kyle's answer is very informative. — romainl, Jun 07 '12 at 14:04
@romainl: In fact my example is also matched by just `\%u030c`, but when I try to extend the pattern from just `COMBINING CARON` to the entire `Combining Diacritical Marks` range by using `[\u0300-\u0364]` nothing is matched any longer! — hippietrail, Jun 07 '12 at 14:16

beerbajay · Answer 1 · 2012-06-07T14:40:28.380

3

If your vim installation is compiled with perl support, you may be able to run:

:perldo s/\X/replacement/g

I installed vim-nox on debian (which contains perl support), and matching \X in with perldo does indeed work, but I'm not sure it will do what you want, since all normal characters are also matched and it doesn't seem like perldo will get you highlighting in vim.

While it's not perfect, if you can get perl support, you can use unicode blocks and categories. Which means you can use \p{Block: Combining_Diacritical_Marks} or \p{Category: Nonspacing_Mark} to at least detect certain characters, though you still won't get highlighting.

edited Jun 07 '12 at 14:40

answered Jun 07 '12 at 13:22

beerbajay

19,652
6
58
75

I also have gVim on Windows actually so also no Perl support. And in fact I just wanted to search for and highlight such characters for now rather than replace them. – hippietrail Jun 07 '12 at 13:25
Thanks for going to the effort to see what's possible! – hippietrail Jun 07 '12 at 13:50
@hippietrail I have vim and gvim on “windows” (wine) and can say that it works just fine with strawberry perl. I compiled it by myself though, but previously used vim from [tuxproject](http://tuxproject.de.nyud.net/projects/vim/) and it worked fine with strawberry perl as well. You need to have the same perl version as the one listed on the project page though. – ZyX Jun 07 '12 at 19:21
Note also that it is not impossible to use python or perl to highlight these characters, but it is much more code and (if you need to constantly update the highlight as the text changes) very, *very* slow. Just slow if you don’t need to update and are fine with highlighting disappeared (or became wrong) after you have edited text in some way. In the latter case you are unlikely to mention slowness unless text is very big. – ZyX Jun 07 '12 at 19:27

score 3 · Answer 2 · edited Jun 07 '12 at 13:57

3

You can search for all characters and ignore composing characters with \Z. Or you can search for a range of Unicode characters. Read :help /[] from more information on both.

The last post here may offer some more help:

http://vim.1045645.n5.nabble.com/using-regexp-to-search-for-Unicode-code-points-and-properties-td1190333.html

But Vim's regex does not have a character class like Perl.

edited Jun 07 '12 at 13:57

Xavier T.

40,509
10
68
97

answered Jun 07 '12 at 13:52

embedded.kyle

10,976
5
37
56

1

While `\Z` is interesting, it doesn't do what the OP wants, which is to match, e.g. `004a 030c` but **not** just `004a`, though this isn't directly clear from the question text, but from "telling me which text ... contains such characters" in the comments. – beerbajay Jun 07 '12 at 14:12
I understand. I was hoping that someone would be able to combine `\Z` with `\[]` to come up with the answer. I had tried `\[\Z^\w]` but that did not work. @romainl was able to take it a step further but we're not totally there yet. – embedded.kyle Jun 07 '12 at 14:19

Does Vim have an equivalent to \X to match Unicode "grapheme clusters"?

2 Answers2