Can someone please explain this behavior of regex:
When I replace the last two characters of a Unicode string with some other Unicode character it works fine with line-boundary ($
) at the end of string but generates unexpected results if I specify the $
in square braces [$]
.
Also the word boundary \b
is giving unexpected results and surprisingly \B
matches what \b
is supposed to match.
>>> line = u'\u0627\u062f\u0646\u06cc\u0670'
>>> re.sub(ur'\u06cc\u0670$', ur'\u0627', line) #works fine
u'\u0627\u062f\u0646\u0627'
>>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line) #unexpected result
u'\u0627\u062f\u0646\u06cc\u0670'
>>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line, re.U) #still not working
u'\u0627\u062f\u0646\u06cc\u0670'
>>> re.sub(ur'\u06cc\u0670\b', ur'\u0627', line, re.U) #unexpected
u'\u0627\u062f\u0646\u06cc\u0670'
>>> re.sub(ur'\u06cc\u0670\B', ur'\u0627', line, re.U) #unexpected
u'\u0627\u062f\u0646\u0627'