Python regex unable to remove content between [%~ abcd ~%]

Question

I have raw HTML and am trying to remove this whole block like this [%~ as..abcd ~%] from the output string. Using re library of python

teststring = "Check the direction . [%~ MACRO wdwDate(date) BLOCK;
                 SET tmpdate = date.clone();
                 END ~%] Determine if both directions."
cleanM = re.compile('\[\%\~ .*? \~\%\]')
scleantext = re.sub(cleanM,'', teststring)

what is wrong in the code ?

The dot `.` doesn't match the newline character by default. You have to use the `re.DOTALL` flag. Also if you compile your pattern, the last line is : `scleantext = cleanM.sub('', teststring)` — Casimir et Hippolyte, Dec 01 '17 at 11:20
`%`, `]`, `~` are not special characters and don't need to be escaped. — Casimir et Hippolyte, Dec 01 '17 at 11:26

mkHun · Accepted Answer · 2017-12-01T11:34:54.813

1

Your pattern should be

cleanM = re.compile(r'\[\%\~ .*? \~\%\]',re.S)

. matches any character except new line, S allows to match the newline

edited Dec 01 '17 at 11:34

answered Dec 01 '17 at 11:24

mkHun

5,891
8
38
85

The caveat is that you need to use `re.compile` when you want to use re.S. It does not work directly in re.sub for whatever reason... – mrCarnivore Dec 01 '17 at 11:29
You can also exclude the markers from the match: `r'(?<=\[%~ ).*(?= \~%])'`. BTW: Always use raw strings (`r'...'`) on regular expressions. – Klaus D. Dec 01 '17 at 11:31

score 0 · Answer 2 · answered Dec 01 '17 at 11:27

You need to use [\S\s]* instead of .* and you can leave out compile:

import re
teststring = '''Check the direction . [%~ MACRO wdwDate(date) BLOCK;
                 SET tmpdate = date.clone();
                 END ~%] Determine if both directions.'''
scleantext = re.sub('(\[%~ [\S\s]* ~%\])', '', teststring)

print(scleantext)

Python regex unable to remove content between [%~ abcd ~%]

2 Answers2