python regular expression matching anything

Question

My regular expression isnt doing anything to my string.

python

data = 'random\n<article stuff\n</article>random stuff'
datareg = re.sub(r'.*<article(.*)</article>.*', r'<article\1</article>', data, flags=re.MULTILINE)
print datareg

i get

random
<article stuff
</article>random stuff

i want

<article stuff
</article>

Aw, c'mon: Not [Cthulhu Parsing](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) again. — pillmuncher, Sep 12 '12 at 23:56
@pillmuncher: it looks like malformed HTML to me, so I doubt a HTML parser would be able to work with it very easily. — Blender, Sep 13 '12 at 02:15
@Blender: I think youÄre right. But the substitution seems to be no valid XML either. I wonder, what does one need broken XML for? — pillmuncher, Sep 13 '12 at 10:11

score 12 · Accepted Answer · answered Sep 12 '12 at 22:25

re.MULTILINE doesn't actually make your regex multiline in the way you want it to be.

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

re.DOTALL does:

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Change flags=re.MULTILINE to flags=re.DOTALL and your regex will work.

python regular expression matching anything

1 Answers1