Brief
Word boundaries \b
act oddly at times, especially when used with Unicode characters. This is due to the nature of the word character \w
and how each flavour of regex interprets it. Word characters \w
are usually defined as a-zA-Z0-9_
. When you enable Unicode matching, some regex flavours include Unicode characters in the word character's set, whilst others do not.
Why all this talk about word characters? Because word boundaries \b
depend on word characters \w
. \b
is an assertion that ensures (^\w|\w$|\W\w|\w\W)
matches at that location.
To cite @Ωmega's answer on this post
The metacharacter \b
is an anchor like the caret and the dollar
sign. It matches at a position that is called a "word boundary".
This match is zero-length. There are three different positions that
qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b
allows you to perform a "whole words only" search
using a regular expression in the form of \bword\b
. A "word
character" is a character that can be used to form words. All
characters that are not "word characters" are "non-word
characters". In all flavors, the characters [a-zA-Z0-9_]
are
word characters. These are also matched by the short-hand character
class \w
. Flavors showing "ascii" for word boundaries in the
flavor comparison recognize only these as word characters.
\w
stands for "word character", usually [A-Za-z0-9_]
. Notice
the inclusion of the underscore and digits.
\B
is the negated version of \b
. \B
matches at every position
where \b
does not. Effectively, \B
matches at any position between
two word characters as well as at any position between two non-word
characters.
\W
is short for [^\w]
, the negated version of \w
.
Code
See this regex in use here
(?:(?<=[^\p{L}\p{N}])|^)42(?=[^\p{L}\p{N}]|$)
Results
Input
42
hello 42
hello-42-
été42
042
4 2
Output
Note: Below are the strings where a match occurred.
42
hello 42
hello-42-
Mongo
Tested and validated with this mongo filter :
{ $regex : '(?:(?<=[^\\p{L}\\p{N}])|^)42(?=[^\\p{L}\\p{N}]|$)' }
Explanation
(?:(?<=[^\p{L}\p{N}])|^)
Match either of the following
(?<=[^\p{L}\p{N}])
Positive lookbehind ensuring what precedes is not a character in the set \p{L}\p{N}
(\p{L}
is a any letter in any language and \p{N}
is any number in any language)
^
Assert position at the start of the line
42
The characters 42
literally
(?=[^\p{L}\p{N}]|$)
Positive lookahead ensuring either of the following matches
[^\p{L}\p{N}]
Match a character that is not present in the set \p{L}\p{N}
$
Assert position at the end of the line
Other options
As @Wiktor Stribiżew mentioned (in the comments under your question), there may be another option if you can use PCRE regex (*UCP)
. The pattern modifier UCP (Unicode Character Properties) allows regex to treat the string as Unicode, which means that \d
and \w
are extended to match other Unicode characters than [0-9]
and [a-zA-Z0-9_]
.
This would allow you to use the regex (*UCP)\b42\b
as seen here
Mongo
Tested and validated with this mongo filter :
{ $regex : '(*UCP)\\b42\\b' }