Replace a multiline pattern with re.sub

Question

In the following string, I'd like to replace all BeginHello...EndHello blocks that contain haha by '':

s = """BeginHello
sqdhaha
fsqd
EndHello

BeginHello
1231323
EndHello

BeginHello
qsd
qsd
haha
qsd
EndHello
BeginHello
azeazezae
azeaze
EndHello
"""

This code:

import re
s = re.sub(r'BeginHello.*haha.*EndHello', '', s)
print s

does not work here: nothing is deleted.

How to use such a regex for a multiline pattern with Python re.sub?

Use `r'(?s)BeginHello(?:(?!BeginHello|EndHello|haha).)*haha.*?EndHello'` or `(?s)BeginHello(?:(?!BeginHello|haha).)*?haha.*?EndHello` — Wiktor Stribiżew, Nov 08 '18 at 13:16
@WillemVanOnsem What do you mean, can you post an answer explaining it? — Basj, Nov 08 '18 at 13:25

score 1 · Accepted Answer · answered Nov 08 '18 at 13:17

1

We can try matching using the following pattern:

BeginHello((?!\bEndHello\b).)*?haha.*?EndHello

This matches an initial BeginHello. Then, it uses a tempered dot:

((?!\bEndHello\b).)*?

to consume anything so long as we do not hit EndHello. This dot is also lazy, and will stop before hitting haha. Effectively, using the above dot means we will only consume without hitting either EndHello or haha. Then, assuming the match works so far, we would consume haha, followed by the nearest EndHello.

s = re.sub(r'BeginHello((?!\bEndHello\b).)*?haha.*?EndHello', '', s,
    flags=re.DOTALL)
print s



BeginHello
1231323
EndHello


BeginHello
azeazezae
azeaze
EndHello

answered Nov 08 '18 at 13:17

Tim Biegeleisen

502,043
27
286
360

Thank you for your solution! Isn't there a solution that avoids to repeat `EndHello` in the regex twice? – Basj Nov 08 '18 at 13:33
Why does that bother you? Shouldn't performance be the major concern? If the answer works and runs reasonably well, then there is no reason to not use it IMHO. – Tim Biegeleisen Nov 08 '18 at 13:34
Yes, it was just out of curiosity. – Basj Nov 08 '18 at 13:38

Theo Emms · Answer 2 · 2018-11-08T13:28:13.593

0

You want re.DOTALL. This basically allows . to match any character including \n

import re
s = re.sub(r'BeginHello.*?haha.*?EndHello', '', s, flags=re.DOTALL)
print s

edited Nov 08 '18 at 13:28

answered Nov 08 '18 at 13:20

Theo Emms

293
1
7

Why would anyone want to use `re.M` with your pattern? That does not allow `.` to match `\n`. – Wiktor Stribiżew Nov 08 '18 at 13:21
I tried with [`re.MULTILINE`](https://docs.python.org/2/library/re.html#re.M) but it does not work. – Basj Nov 08 '18 at 13:24
Yeah, sorry I got mixed up with ```re.DOTALL``` – Theo Emms Nov 08 '18 at 13:26
@TheoEmms With your current solution, the output is empty (probably because of "greedy" regex). – Basj Nov 08 '18 at 13:27
You're right... I've obviously not used regex enough recently... Answer is updated – Theo Emms Nov 08 '18 at 13:28
The answer is wrong. You can't use `.*` or `.*?` here with any options. You need a tempered greedy token, unrolled or not. – Wiktor Stribiżew Nov 08 '18 at 13:30

Replace a multiline pattern with re.sub

2 Answers2