0

Can someone please explain this behavior of regex:

When I replace the last two characters of a Unicode string with some other Unicode character it works fine with line-boundary ($) at the end of string but generates unexpected results if I specify the $ in square braces [$].

Also the word boundary \b is giving unexpected results and surprisingly \Bmatches what \b is supposed to match.

>>> line = u'\u0627\u062f\u0646\u06cc\u0670'

>>> re.sub(ur'\u06cc\u0670$', ur'\u0627', line) #works fine
u'\u0627\u062f\u0646\u0627' 

>>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line) #unexpected result
u'\u0627\u062f\u0646\u06cc\u0670' 

>>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line, re.U) #still not working
u'\u0627\u062f\u0646\u06cc\u0670'

>>> re.sub(ur'\u06cc\u0670\b', ur'\u0627', line, re.U) #unexpected 
u'\u0627\u062f\u0646\u06cc\u0670'

>>> re.sub(ur'\u06cc\u0670\B', ur'\u0627', line, re.U) #unexpected
u'\u0627\u062f\u0646\u0627'
Irshad Bhat
  • 8,479
  • 1
  • 26
  • 36

1 Answers1

1
  1. The signature of re.sub is:

    sub(pattern, repl, string, count=0, flags=0)
    

    The re.U flag is being passed to count, so the re.U flag does nothing. Make sure you use the keyword argument like:

    re.sub(ur'\u06cc\u0670\b', u'\u0627', line, flags=re.U)
    #                                           ^~~~~~ 
    
  2. […] defines a character class, and $ is not special inside the brackets. So [$] will just match a literal dollar sign.

  3. \b matches the boundary between a word ("\w") and non-word ("\W", or the start/end of string), and \B matches anywhere that is not \b. Now, \u0670 is a non-word in Unicode:

    >>> re.findall(ur'\w', line, flags=re.U)
    [u'\u0627', u'\u062f', u'\u0646', u'\u06cc']
    >>> re.findall(ur'\W', line, flags=re.U)
    [u'\u0670']
    

    This means the end of string after \u0670 is not a word-boundary, because \u0670 is not a word. So \b cannot match it, and that means \B will match it.

    The meaning of \w in Unicode is "[0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database".

    Characters like U+06CC (Arabic Letter Farsi Yeh) is categorized as Letter, Other (Lo) so it is a word, but U+0670 (Arabic Letter Superscript Alef) is categorized as Mark, Nonspacing (Mn) so it is not considered a word.

(You may check detail of Python's regex syntax in https://docs.python.org/2/library/re.html)


As for the comment below, you can use a negative look-ahead instead of a group:

re.sub(ur'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line, flags=re.U)

Here,

  • [\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2] is the same as your \u06cc\u0670|\u06d2\u0670|\u0670\u06cc|\u0670\u06d2, but with similar cases grouped together
  • (?:…) defines a non-capturing group, so that the "\b" you want can be extracted out from the alternations
  • (?!\w) means we match only if the next character is not a word.

The result is like:

>>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line, flags=re.U)
u'\u0627\u062f\u0646\u0627'
>>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line + u'\u0646', flags=re.U)
u'\u0627\u062f\u0646\u06cc\u0670\u0646'
>>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line + u'\u061f', flags=re.U)
u'\u0627\u062f\u0646\u0627\u061f'
Community
  • 1
  • 1
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • Actually I want something like `re.sub(u'\u06cc\u0670\b|\u06d2\u0670\b|\u0670\u06cc\b|\u0670\u06d2\b', u'\u0627', line, re.U)` but since `\b` wont work in this case, so I tried `re.sub(u'\u06cc\u0670(\s)|\u06d2\u0670(\s)|\u0670\u06cc(\s)|\u0670\u06d2(\s)', ur'\u0627\1', line, re.U)` but now I don't know which group number I've to replace with `\1` or `\2` or ... . Can you please help me with this. – Irshad Bhat Dec 25 '15 at 18:10