0

In the following string, I'd like to replace all BeginHello...EndHello blocks that contain haha by '':

s = """BeginHello
sqdhaha
fsqd
EndHello

BeginHello
1231323
EndHello

BeginHello
qsd
qsd
haha
qsd
EndHello
BeginHello
azeazezae
azeaze
EndHello
"""

This code:

import re
s = re.sub(r'BeginHello.*haha.*EndHello', '', s)
print s

does not work here: nothing is deleted.

How to use such a regex for a multiline pattern with Python re.sub?

Basj
  • 41,386
  • 99
  • 383
  • 673

2 Answers2

1

We can try matching using the following pattern:

BeginHello((?!\bEndHello\b).)*?haha.*?EndHello

This matches an initial BeginHello. Then, it uses a tempered dot:

((?!\bEndHello\b).)*?

to consume anything so long as we do not hit EndHello. This dot is also lazy, and will stop before hitting haha. Effectively, using the above dot means we will only consume without hitting either EndHello or haha. Then, assuming the match works so far, we would consume haha, followed by the nearest EndHello.

s = re.sub(r'BeginHello((?!\bEndHello\b).)*?haha.*?EndHello', '', s,
    flags=re.DOTALL)
print s



BeginHello
1231323
EndHello


BeginHello
azeazezae
azeaze
EndHello
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thank you for your solution! Isn't there a solution that avoids to repeat `EndHello` in the regex twice? – Basj Nov 08 '18 at 13:33
  • Why does that bother you? Shouldn't performance be the major concern? If the answer works and runs reasonably well, then there is no reason to not use it IMHO. – Tim Biegeleisen Nov 08 '18 at 13:34
  • Yes, it was just out of curiosity. – Basj Nov 08 '18 at 13:38
0

You want re.DOTALL. This basically allows . to match any character including \n

import re
s = re.sub(r'BeginHello.*?haha.*?EndHello', '', s, flags=re.DOTALL)
print s
Theo Emms
  • 293
  • 1
  • 7