Python's re.sub returns data in wrong encoding from unicode

Question

>>> re.sub('\w', '\1', 'абвгдеёжз')
'\x01\x01\x01\x01\x01\x01\x01\x01\x01'

Why does re.sub return data in this format? I want it to return the unaltered string 'абвгдеёжз' in this case. Changing the string to u'абвгдеёжз' or passing flags=re.U doesn't do anything.

Note: You should *always* be using raw strings for regex patterns and almost always for replacement patterns; you got lucky with `'\w'` (because Python is "nice" and leaves the backslash in place because `\w` isn't a recognized escape code for strings), but other regex escapes overlap string escapes; `'\b'` is an ASCII backspace, `r'\b'` is an actual backslash followed by `b` that `re` will interpret as a word boundary assertion as most people expect. — ShadowRanger, Sep 26 '19 at 19:57

score 4 · Accepted Answer · answered Sep 26 '19 at 19:52

Because '\1' is the character with codepoint 1 (and its repr form is '\x01'). re.sub never saw your backslash, per the rules on string literals. Even if you did escape it, such as in r'\1' or '\\1', reference 1 isn't the right number; you need parenthesis to define groups. r'\g<0>' would work as described in the re.sub documentation.

score 0 · Answer 2 · answered Sep 26 '19 at 20:39

0

Perhaps you meant to:

>>>> re.sub('(\w)', r'\1', 'абвгдеёжз')
'абвгдеёжз'

answered Sep 26 '19 at 20:39

user2468968

286
3
9

Python's re.sub returns data in wrong encoding from unicode

2 Answers2