MongoDB regex option 'u' (UTF-8)

Question

I build a regex and I tried to use it with MongoDB but I don't have the expected result because I don't know how to enable UTF-8 support.

For example with the regex : \b42\b
Example available here

Should match :

42
hello 42
hello-42-

Shouldn't match :

été42 
042
 4 2

The tricky match is this one : été42, without UTF-8 option it match with the regex but shouldn't.

The documentation don't mention the usual u option.

So actually my query is :

db.getCollection('collection_name').find(
{
    'title' : { $regex : '\\b42\\b', $options: 'i'}
});

I use MongoDB version 3.2 but this issue is the same with 3.4.

Possible duplicate of [difference between \w and \b regular expression meta characters](https://stackoverflow.com/questions/11874234/difference-between-w-and-b-regular-expression-meta-characters) — ctwheels, Nov 01 '17 at 17:50
It's an issue of `\b` not matching Unicode characters. The word boundary token is weird at times, especially when you try to mix it with Unicode characters (and even sometimes when you enable Unicode in regex (`u` modifier) — ctwheels, Nov 01 '17 at 18:03
@ctwheels [Here](https://regex101.com/r/7wWuwo/3) my example with `\b` and `u` option : it works fine. But the same with `\w` doesn't have any match. I probably miss something but I don't think that `\w` is the answer to my question. — Opsse, Nov 01 '17 at 18:09
Take a look at this other recent post with same issue: https://stackoverflow.com/questions/46917131/regex-unicode-and-accent?noredirect=1#comment80787344_46917131 — ctwheels, Nov 01 '17 at 18:10
You need this regex `(?:(?<=[^\p{L}\p{N}])|^)42(?=[^\p{L}\p{N}]|$)`, but I don't think it'll work in Mongo. — ctwheels, Nov 01 '17 at 18:16
@ctwheels It seems to work but I don't really understand why nothing more simple is possible. Could you please remove this wrong duplicate tag. If you want to help, add a real answer. — Opsse, Nov 01 '17 at 18:28
If you can't define the `u` modifier in `$options`, try adding the PCRE verb `(*UCP)` and try `$regex : '(*UCP)\\b42\\b'` — Wiktor Stribiżew, Nov 01 '17 at 18:46
@WiktorStribiżew I hope it's ok, I've added your regex to my solution below. If not, just let me know and I'll remove it. — ctwheels, Nov 01 '17 at 19:10
@ctwheels You actually should not do that, but I do not care. — Wiktor Stribiżew, Nov 01 '17 at 19:11
@WiktorStribiżew I can remove it if you feel that I should. I added it for completeness but if you add an answer with your solution I'll be more than willing to remove it from my answer entirely. — ctwheels, Nov 01 '17 at 19:14
@Opsse out of curiosity and for future viewers of this question, which method worked in MongoDB? Would you be able to add a comment under my answer to let other MongoDB users know which method worked, or if both methods worked? — ctwheels, Nov 02 '17 at 13:22
@ctwheels Both solutions work fine, I edited your answer with mongo requests. — Opsse, Nov 02 '17 at 15:04
@Opsse looks good, I approved the edit. Thank you for updating the answer to include usage and to confirm which one(s) worked! — ctwheels, Nov 02 '17 at 15:06

score 1 · Accepted Answer · edited Nov 02 '17 at 15:05

Brief

Word boundaries \b act oddly at times, especially when used with Unicode characters. This is due to the nature of the word character \w and how each flavour of regex interprets it. Word characters \w are usually defined as a-zA-Z0-9_. When you enable Unicode matching, some regex flavours include Unicode characters in the word character's set, whilst others do not.

Why all this talk about word characters? Because word boundaries \b depend on word characters \w. \b is an assertion that ensures (^\w|\w$|\W\w|\w\W) matches at that location.

To cite @Ωmega's answer on this post

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length. There are three different positions that qualify as word boundaries:

Before the first character in the string, if the first character is a word character.

After the last character in the string, if the last character is a word character.

Between two characters in the string, where one is a word character and the other is not a word character.

Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters". In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class \w. Flavors showing "ascii" for word boundaries in the flavor comparison recognize only these as word characters.

\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.

\B is the negated version of \b. \B matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters.

\W is short for [^\w], the negated version of \w.

Code

See this regex in use here

(?:(?<=[^\p{L}\p{N}])|^)42(?=[^\p{L}\p{N}]|$)

Results

Input

42
hello 42
hello-42-

été42 
042
 4 2

Output

Note: Below are the strings where a match occurred.

42
hello 42
hello-42-

Mongo

Tested and validated with this mongo filter :
{ $regex : '(?:(?<=[^\\p{L}\\p{N}])|^)42(?=[^\\p{L}\\p{N}]|$)' }

Explanation

(?:(?<=[^\p{L}\p{N}])|^) Match either of the following
- (?<=[^\p{L}\p{N}]) Positive lookbehind ensuring what precedes is not a character in the set \p{L}\p{N} (\p{L} is a any letter in any language and \p{N} is any number in any language)
- ^ Assert position at the start of the line
42 The characters 42 literally
(?=[^\p{L}\p{N}]|$) Positive lookahead ensuring either of the following matches
- [^\p{L}\p{N}] Match a character that is not present in the set \p{L}\p{N}
- $ Assert position at the end of the line

Other options

As @Wiktor Stribiżew mentioned (in the comments under your question), there may be another option if you can use PCRE regex (*UCP). The pattern modifier UCP (Unicode Character Properties) allows regex to treat the string as Unicode, which means that \d and \w are extended to match other Unicode characters than [0-9] and [a-zA-Z0-9_].

This would allow you to use the regex (*UCP)\b42\b as seen here

Mongo

Tested and validated with this mongo filter :
{ $regex : '(*UCP)\\b42\\b' }

MongoDB regex option 'u' (UTF-8)

1 Answers1

Brief

Code

Results

Input

Output

Mongo

Explanation

Other options

Mongo