The difference between with or without the WORD
flag is the way word boundaries are defined.
Given this example:
import regex
t = 'A number: 3.4 :)'
print(regex.search(r'\b3\b', t))
print(regex.search(r'\b3\b', t, flags=regex.WORD))
The first will print a match while the second returns None
, why? Because “Unicode word boundary” contains a set of rules for distinguishing word boundaries, while the default python word boundary defines it as any non-\w
characters (which is still Unicode alphanumeric).
In the example, 3.4
was split by python’s default word boundary since a \W
character was present, the period, therefore it’s a word boundary. For Unicode word boundary,
A rule states “Forbidden Breaks on “.”” example as “3.4”, therefore the period wasn’t considered a word boundary.
See all the Unicode word boundary rules here: https://unicode.org/reports/tr29/#Sentence_Boundary_Rules
Conclusion:
They both work with Unicode or your LOCALE
, but WORD
flag provides additional set of rules for distinguishing word boundaries in addition to just empty string of a \W
, since “a word is defined as a sequence of word character [\w
]”.