-2

I am trying to replace a regex group's surroundings. I want to replace QQQQQ and SSSSS by LLL and MMM, with the stuff in the middle, before and after staying the same. (There may be several occurrences of QQQQQ and SSSSS).

In the code below, (1) seems to show .*? can find the right string.

But in (2), using (.*?) as a group also finds the right string, but gets a 0 in the replacement.

In (3) and (4), the DOTALL doesn't find anything string.

I'm using regex here, but it's the same with re. I also tried $1 instead of \1

Here the code:

doc1 = """AAA QQQQQ azertyuiop SSSSS BBB"""
doc2 = """
AAA
QQQQQ
azertyuiop
SSSSS
BBB
"""
# (1) OK - gives AAA LLL dd MMM BBB. .*? finds the right string
doc = regex.sub("QQQQQ.*?SSSSS", "LLL dd MMM", doc1)
print(doc)

# (2) gives AAA LLL ☺ MMM BBB - where does this ☺ come from?
doc = regex.sub("QQQQQ(.*?)SSSSS", "LLL \1 MMM", doc1)
print(doc)     

# (3) leaves string unchanged. Isn't DOTALL supposed to match line breaks?
doc = regex.sub("QQQQQ.*?SSSSS", "LLL dd MMM", doc2, regex.DOTALL)
print(doc)   

# (4) leaves string unchanged
doc = regex.sub("QQQQQ(.*?)SSSSS", "LLL \1 MMM", doc2, regex.DOTALL)
print (doc)   # leaves unchanged

(4) is what I am attempting to do

Francis
  • 563
  • 1
  • 7
  • 14
  • 1
    Use raw strings for regular expressions. Otherwise `\1` means a character with code `1`, not a back-reference. – Barmar Dec 20 '21 at 15:40
  • raw strings: this makes (2) work. But (3) and (4) still don't... – Francis Dec 20 '21 at 16:45
  • 1
    You need `flags=regex.DOTALL`. The 4th positional argument to `regex.sub()` is `count`, not `flags`. – Barmar Dec 20 '21 at 23:02
  • @Barmar `flags=regex.DOTALL` this was the answer. @downvoters: the link provided does not answer the question. It is only relevant to the first half of the question if you already know that the problem relates to raw strings, and if you know that then you've already solved the problem. It says nothing about DOTALL. – Francis Dec 21 '21 at 07:28
  • You asked multiple questions. The link explains why `\1` doesn't work in case 2. – Barmar Dec 21 '21 at 14:50
  • @Barmar there is only one question, and the link does not answer it. I broke the question down into sections to show how I reached the final expression. – Francis Jan 04 '22 at 14:04
  • What I meant was that there are different reasons for each failure, so it's like they're different questions. – Barmar Jan 04 '22 at 16:57

1 Answers1

1

There are two problems.

  1. You're not passing re.DOTALL as the correct argument. If you use positional arguments, flags is the 4th argument, but you're passing it as the third argument. Use the flags= keyword to pass it properly.
  2. \1 in the replacement string is being interpreted as a character escape sequence, not a back-reference. That's why you get a funny character there. Use a raw string to prevent escape sequence processing. In general you should always use raw strings for regular expressions and replacement strings, because of the extensive use of backslash in them; see What exactly is a "raw string regex" and how can you use it?.

This should work:

doc = regex.sub(r"QQQQQ(.*?)SSSSS", r"LLL \1 MMM", doc2, flags=regex.DOTALL)
Barmar
  • 741,623
  • 53
  • 500
  • 612