Match Unicode character with regular expression

Question

I can use regular expressions in VBA for Word 2019:

Dim RegEx As New RegExp
Dim Matches As MatchCollection

RegEx.Pattern = "[\d\w]+"
Text = "HelloWorld"
Set Matches = RegEx.Execute(Text)

But how can I match all Unicode characters and all digits too?

\p{L} works fine for me in PHP, but this doesn't work for me in VBA for Word 2019.

I would like to find words with characters and digits. So in PHP I use for this [\p{L}\p{N}]+. Which pattern can I use for this in VBA?

Currently, I would like to match words with German character, like äöüßÄÖÜ. But maybe I need this for other languages too.

Unicode characters are [not](https://stackoverflow.com/a/60192760/9758194) supported in VBA's regular expressions. What are you trying to exclude from your match? Do you have sample data? — JvdV, May 25 '20 at 06:39

Tomalak · Accepted Answer · 2020-05-25T10:57:06.140

But how can I match all Unicode characters and all digits too?

"VBScript Regular Expressions 5.5" (which I am pretty sure you are using here) are not "VBA Regular Expressions", they are a COM library that you can use in - among other things - VBA. They do not support Unicode with the built-in metacharacters (such as \w) and they have no knowledge of Unicode character classes (such as \p{L}). But of course you can still match Unicode characters with them.

Direct Matches

The simplest way is of course to directly use the Unicode characters you search for in the pattern. VBA uses Unicode strings, so matching Unicode is not a problem per se. Representing Unicode in your VBA source code, which itself is not Unicode, is a different matter. But ChrW() can help with that.

Assuming you have a certain character you want to match,

RegEx.Pattern = ChrW(&h4E16) & ChrW(&h754C) 
Set Matches = RegEx.Execute(Text)

Msgbox Matches(0)

The above uses hex numbers (&h...) and ChrW() to create the Unicode characters U+4E16 and U+754C (世界) at run-time. When they are in your text, they will be found. This is tedious, but it works well if you already know what words you're looking for.

Ranges

If you want to match character ranges, you can do that as well. Use the start point and end point of the range. For example, the basic block of the "CJK Unified Ideographs" range goes from U+4E00 to U+9FFF:

RegEx.Pattern = "[" + ChrW(&h4E00) & "-" & ChrW(&h9FFF) & "]+" 
Set Matches = RegEx.Execute(Text)

Msgbox Matches(0)

So this creates a natural range just like [a-z]+ to span all of the CJK characters. You'd have to define which ranges you want to match, so it's less convenient has having built-in support, but nothing is stopping you.

Caveats

The above is about matching Characters inside of the BMP (Basic Multilingual Plane). Characters outside of the BMP, such as Emoji, is a lot more difficult because of the way these characters work in Unicode. It's still possible, but it's not going to be pretty.

There are multiple ways of representing the same character. For example, ä could be represented by its own, singluar code-point, or by a followed by a second code-point for the dots (U+0308 "◌̈"). Since there is no telling how your input string represents certain characters, you should look into Unicode Normalization to make strings uniform before you search in them. In VBA this can be done by using the Win32 API.

Helpers

You can research Unicode ranges manually, but since there are so many of them, it's easy to miss some. I remember a useful helper for manually picking Unicode ranges, which now still lives on the Internet Archive: http://web.archive.org/web/20191118224127/http://kourge.net/projects/regexp-unicode-block

It allows you to qickly build regexes that span multiple ranges. It's aimed at JavaScript, but it's easy enough to adapt the output for VBA code.

Match Unicode character with regular expression

1 Answers1

Direct Matches

Ranges

Caveats

Helpers