Python regex string escaping for re.sub replace argument?

Question

Using re module it's possible to use escaping for the replace pattern. eg:

def my_replace(string, src, dst):
    import re
    return re.sub(re.escape(src), dst, string)

While this works for the most-part, the dst string may include "\\9" for example.

This causes an issue:

\\1, \\2 ... etc in dst, literals will be interpreted as groups.
using re.escape(dst) causes . to be changed to \..

Is there a way to escape the destination without introducing redundant character escaping?

Example usage:

>>> my_replace("My Foo", "Foo", "Bar")
'My Bar'

So far, so good.

>>> my_replace("My Foo", "Foo", "Bar\\Baz")
...
re.error: bad escape \B at position 3

This tries to interpret \B as having a special meaning.

>>> my_replace("My Foo", "Foo", re.escape("Bar\\Baz"))
'My Bar\\Baz'

Works!

>>> my_replace("My Foo", "Foo", re.escape("Bar\\Baz."))
'My Bar\\Baz\\.'

The . gets escaped when we don't want that.

While in this case str.replace can be used, the question about destination string remains useful since there may be times we want to use other features of re.sub such as the ability to ignore case.

I'm not sure I understand the issue - could you give an example string, src, dst which demonstrates it? — wim, Oct 09 '19 at 03:39
Looks like what you really want is `src.replace(r'\', r'\\')` as you don't seem to want `.` be replaced. — metatoaster, Oct 09 '19 at 03:51
@metatoaster Do you meant `dst` ? - if this avoids all possible interpretations, then yes. — ideasman42, Oct 09 '19 at 03:55
@ideasman42 yes. If you only want just this character this would be a way. If you want multiple modifications from this subset, using [`str.translate`](https://docs.python.org/3/library/stdtypes.html#str.translate) may be more desirable. Best approach is to create a number of test cases (add them to your unit test module) to formalise the problem you are trying to solve. — metatoaster, Oct 09 '19 at 04:04
@ideasman42 Did you get a solution to this without replacing the dst variable. In my case the capture groups are being treated as literals without the re.escape() — Sourav Kanta, May 04 '20 at 13:57
@metatoaster Your code does not work. Raw strings in Python cannot contain single backslash as the last character. The change of the line in the original function would be: `return re.sub(re.escape(src), dst.replace('\\', r'\\'), string)` — pabouk - Ukraine stay strong, May 20 '22 at 08:01
@pabouk-Ukrainestaystrong fair, though the demonstration of using `r'\'` was more an illustrative purpose. — metatoaster, May 21 '22 at 00:24

ideasman42 · Answer 1 · 2022-05-20T11:31:22.920

5

In this case only the back-slash is interpreted as a special character, so instead of re.escape, you can use a simple replacement on in destination argument.

def my_replace(string, src, dst):
    import re
    return re.sub(re.escape(src), dst.replace("\\", "\\\\"), string)

edited May 20 '22 at 11:31

answered Oct 09 '19 at 05:27

ideasman42

42,413
44
197
320

Raw strings in Python cannot contain single backslash as the last character. The modified argument would be: `dst.replace("\\", r"\\")` or maybe less confusingly without combining normal and raw strings: `dst.replace("\\", "\\\\")` – pabouk - Ukraine stay strong May 20 '22 at 08:07
`r"\\" == "\\\\"` is true here for Python 3.10. – ideasman42 May 20 '22 at 10:17
That just supports the second variant in my comment and it should be true in all supported versions. --- I was notifying you about something completely different: *You cannot have a **single (precisely: unpaired)** backslash as the last character of a raw string.* (Paired are fine.) This fails also in Python 3.10 (which started to take advantage of the new PEG parser) --- `>>> sys.version` `'3.10.4 (main, Apr 2 2022, 09:04:19) [GCC 11.2.0]'` `>>> r"\"` ... `SyntaxError: unterminated string literal (detected at line 1)` – pabouk - Ukraine stay strong May 20 '22 at 10:49
Good explanation: [Why can't Python's raw string literals end with a single backslash?](https://stackoverflow.com/a/19654184/320437) – pabouk - Ukraine stay strong May 20 '22 at 11:03
Ah `r"\"` does indeed fail, thanks - updated answer. – ideasman42 May 20 '22 at 11:31

Emma · Answer 2 · 2019-10-09T04:08:05.843

Your code works fine, if you would just remove that re.escape, which I'm not sure why we would have that:

Test 1

import re 

def my_replace(src, dst, string):
    return re.sub(src, dst, string)


string = 'abbbbbb'
src = r'(ab)b+'
dst = r'\1z'

print(my_replace(src, dst, string))

Output 1

abz

Test 2

import re


def my_replace(src, dst, string):
    return re.sub(src, dst, string)


string = re.escape("abbbbbbBar\\Baz")
src = r'(ab)b+'
dst = r'\1z'

print(my_replace(src, dst, string))

Output 2

abzBar\Baz

Test 3

import re


def my_replace(src, dst, string):
    return re.sub(src, dst, string)


string = re.escape("abbbbbbBar\\Baz")
src = r'(ab)b+'
dst = r'\1' + re.escape('\\z')

print(my_replace(src, dst, string))

Output 3

ab\zBar\\Baz

Test 4

To construct the dst, we have to first know if we'd be replacing our string with any capturing groups such as \1 in this case. We cannot re.escape \1, otherwise we would replace our string with \\1, we have to construct the replacement, if there are capturing groups, then append it to any other part that requires re.escaping.

import re


def my_replace(src, dst, string):
    return re.sub(src, dst, string)


string = re.escape("abbbbbbBar\\Baz")
src = r'(ab)b+'
dst = r'\1' + re.escape('\9z')

print(my_replace(src, dst, string))

Output 4

ab\9zBar\\Baz

Escape is needed because I don't have control over the arguments. they may contain special characters which need to be interpreted as literals. — ideasman42, Oct 09 '19 at 03:47
Test 2, is interpreting the destination, try: `dst = r'\9z'` — ideasman42, Oct 09 '19 at 03:57

score 0 · Answer 3 · answered Oct 09 '19 at 04:03

You could resort to split:

haystack = r"some text with stu\ff to replace"
needle = r"stu\ff"
replacement = r"foo.bar"

result = replacement.join(re.split(re.escape(needle), haystack))
print(result)

This should also work with needle at the beginning or end of haystack.