2

I'm using the PyPI module regex for regex matching. It says

  • Default Unicode word boundary

    The WORD flag changes the definition of a ‘word boundary’ to that of a default Unicode word boundary. This applies to \b and \B.

But nothing seems to have changed:

>>> r1 = regex.compile(r".\b.", flags=regex.UNICODE)
>>> r2 = regex.compile(r".\b.", flags=regex.UNICODE | regex.WORD)
>>> r1.findall("русский  ελλανικα")
['й ', ' ε']
>>> r2.findall("русский  ελλανικα")
['й ', ' ε']

I didn't observe any difference...?

iBug
  • 35,554
  • 7
  • 89
  • 134
  • The way you can tell is to use a non-Unicode regex simulation `(?:(?:^|(?<=[^a-zA-Z0-9_]))(?=[a-zA-Z0-9_])|(?<=[a-zA-Z0-9_])(?:$|(?=[^a-zA-Z0-9_])))` which has no match... obviously ! –  Sep 20 '18 at 01:19
  • @sln no................... Python regex matches Unicode with `\w` correctly, and that flag only affects `\b`, as the docs says. I recommend you quit this argument. – iBug Sep 20 '18 at 01:22
  • Well, I guess WORD doesn't affect boundary correctly, unless you can prove it .. –  Sep 20 '18 at 01:25
  • For what its worth, you can see the same behaviore here https://regex101.com/r/0a0pfX/1 and note the default state are no flags other than global. I estimate it is using the re module, but there is a Unicode flag that does nothing, so it might be a hold over within the regex module so as not to disturb anything. –  Sep 20 '18 at 01:28
  • @sln regex101 isn't good for this. I specifically said I'm using a 3rd-party module instead of Python's stock `re`. There are differences. – iBug Sep 20 '18 at 01:32
  • You mean the _regex_ replacement module ? I don't think that's 3rd party, it's pretty much the replacement for re. And there is more differences than you can digest. Even for me it's taxing. –  Sep 20 '18 at 01:33
  • @sln At least the `WORD` flag isn't present in the stock `re`, which regex101 runs. – iBug Sep 20 '18 at 01:35
  • Well, the point being regarding the re, (not to be nit-picking but) in the string you're using `русский ελλανικα` there are no _word_ characters, so there can be no _word boundary_ anywhere. Unicode flag or not, it still matches what your's matched. You gotta wonder about that. –  Sep 20 '18 at 02:24

1 Answers1

2

The difference between with or without the WORD flag is the way word boundaries are defined.

Given this example:

import regex

t = 'A number: 3.4 :)'

print(regex.search(r'\b3\b', t))
print(regex.search(r'\b3\b', t, flags=regex.WORD))

The first will print a match while the second returns None, why? Because “Unicode word boundary” contains a set of rules for distinguishing word boundaries, while the default python word boundary defines it as any non-\w characters (which is still Unicode alphanumeric).

In the example, 3.4 was split by python’s default word boundary since a \W character was present, the period, therefore it’s a word boundary. For Unicode word boundary, A rule states “Forbidden Breaks on “.”” example as “3.4”, therefore the period wasn’t considered a word boundary.

See all the Unicode word boundary rules here: https://unicode.org/reports/tr29/#Sentence_Boundary_Rules

Conclusion:

They both work with Unicode or your LOCALE, but WORD flag provides additional set of rules for distinguishing word boundaries in addition to just empty string of a \W, since “a word is defined as a sequence of word character [\w]”.

Taku
  • 31,927
  • 11
  • 74
  • 85
  • Are you sure a word boundary is defined there ? I mean it looks like a lot of `word_break` properties, not to be confused with the `\b` syntax. –  Sep 20 '18 at 02:25
  • So it is word break property. I can tell you imo, it is fairly impossible to implement that in the `\b` construct. In the non-Unicode implementation of `\b` done in C, it is really a string primitive without much overhead. In Unicode implementation, once the word is defined (and it's more than alnum property, its all the minutia that underscore represents), it is much more complicated. –  Sep 20 '18 at 02:32
  • Yeah, I see that. I can tell you that regex implementers will not try to implement this complexity via sentences at all. I can see that guy who did _regex_ trying it though. Have you seen some of the bizarr syntax he uses for regex, omg ... –  Sep 20 '18 at 02:36
  • Yeah, that’s probably why it’s not an option in the standard regex library in Python. – Taku Sep 20 '18 at 02:37