0
>>> re.sub('\w', '\1', 'абвгдеёжз')
'\x01\x01\x01\x01\x01\x01\x01\x01\x01'

Why does re.sub return data in this format? I want it to return the unaltered string 'абвгдеёжз' in this case. Changing the string to u'абвгдеёжз' or passing flags=re.U doesn't do anything.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
user10687617
  • 33
  • 1
  • 3
  • 1
    Note: You should *always* be using raw strings for regex patterns and almost always for replacement patterns; you got lucky with `'\w'` (because Python is "nice" and leaves the backslash in place because `\w` isn't a recognized escape code for strings), but other regex escapes overlap string escapes; `'\b'` is an ASCII backspace, `r'\b'` is an actual backslash followed by `b` that `re` will interpret as a word boundary assertion as most people expect. – ShadowRanger Sep 26 '19 at 19:57

2 Answers2

4

Because '\1' is the character with codepoint 1 (and its repr form is '\x01'). re.sub never saw your backslash, per the rules on string literals. Even if you did escape it, such as in r'\1' or '\\1', reference 1 isn't the right number; you need parenthesis to define groups. r'\g<0>' would work as described in the re.sub documentation.

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26
0

Perhaps you meant to:

>>>> re.sub('(\w)', r'\1', 'абвгдеёжз')
'абвгдеёжз'
user2468968
  • 286
  • 3
  • 9