Unexpected behavior of regex word boundaries with Unicode strings

Question

Can someone please explain this behavior of regex:

When I replace the last two characters of a Unicode string with some other Unicode character it works fine with line-boundary ($) at the end of string but generates unexpected results if I specify the $ in square braces [$].

Also the word boundary \b is giving unexpected results and surprisingly \Bmatches what \b is supposed to match.

>>> line = u'\u0627\u062f\u0646\u06cc\u0670'

>>> re.sub(ur'\u06cc\u0670$', ur'\u0627', line) #works fine
u'\u0627\u062f\u0646\u0627' 

>>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line) #unexpected result
u'\u0627\u062f\u0646\u06cc\u0670' 

>>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line, re.U) #still not working
u'\u0627\u062f\u0646\u06cc\u0670'

>>> re.sub(ur'\u06cc\u0670\b', ur'\u0627', line, re.U) #unexpected 
u'\u0627\u062f\u0646\u06cc\u0670'

>>> re.sub(ur'\u06cc\u0670\B', ur'\u0627', line, re.U) #unexpected
u'\u0627\u062f\u0646\u0627'

score 1 · Accepted Answer · edited May 23 '17 at 11:52

The signature of re.sub is:

sub(pattern, repl, string, count=0, flags=0)

The re.U flag is being passed to count, so the re.U flag does nothing. Make sure you use the keyword argument like:

re.sub(ur'\u06cc\u0670\b', u'\u0627', line, flags=re.U)
#                                           ^~~~~~

[…] defines a character class, and $ is not special inside the brackets. So [$] will just match a literal dollar sign.
\b matches the boundary between a word ("\w") and non-word ("\W", or the start/end of string), and \B matches anywhere that is not \b. Now, \u0670 is a non-word in Unicode:
```
>>> re.findall(ur'\w', line, flags=re.U)
[u'\u0627', u'\u062f', u'\u0646', u'\u06cc']
>>> re.findall(ur'\W', line, flags=re.U)
[u'\u0670']
```
This means the end of string after \u0670 is not a word-boundary, because \u0670 is not a word. So \b cannot match it, and that means \B will match it.

The meaning of \w in Unicode is "[0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database".

Characters like U+06CC (Arabic Letter Farsi Yeh) is categorized as Letter, Other (Lo) so it is a word, but U+0670 (Arabic Letter Superscript Alef) is categorized as Mark, Nonspacing (Mn) so it is not considered a word.

(You may check detail of Python's regex syntax in https://docs.python.org/2/library/re.html)

As for the comment below, you can use a negative look-ahead instead of a group:

re.sub(ur'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line, flags=re.U)

Here,

[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2] is the same as your \u06cc\u0670|\u06d2\u0670|\u0670\u06cc|\u0670\u06d2, but with similar cases grouped together
(?:…) defines a non-capturing group, so that the "\b" you want can be extracted out from the alternations
(?!\w) means we match only if the next character is not a word.

The result is like:

>>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line, flags=re.U)
u'\u0627\u062f\u0646\u0627'
>>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line + u'\u0646', flags=re.U)
u'\u0627\u062f\u0646\u06cc\u0670\u0646'
>>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line + u'\u061f', flags=re.U)
u'\u0627\u062f\u0646\u0627\u061f'

Actually I want something like `re.sub(u'\u06cc\u0670\b|\u06d2\u0670\b|\u0670\u06cc\b|\u0670\u06d2\b', u'\u0627', line, re.U)` but since `\b` wont work in this case, so I tried `re.sub(u'\u06cc\u0670(\s)|\u06d2\u0670(\s)|\u0670\u06cc(\s)|\u0670\u06d2(\s)', ur'\u0627\1', line, re.U)` but now I don't know which group number I've to replace with `\1` or `\2` or ... . Can you please help me with this. — Irshad Bhat, Dec 25 '15 at 18:10

Unexpected behavior of regex word boundaries with Unicode strings

1 Answers1