9

My regular expression isnt doing anything to my string.

python

data = 'random\n<article stuff\n</article>random stuff'
datareg = re.sub(r'.*<article(.*)</article>.*', r'<article\1</article>', data, flags=re.MULTILINE)
print datareg

i get

random
<article stuff
</article>random stuff

i want

<article stuff
</article>
user1442957
  • 7,191
  • 5
  • 22
  • 19
  • 2
    Aw, c'mon: Not [Cthulhu Parsing](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) again. – pillmuncher Sep 12 '12 at 23:56
  • 1
    @pillmuncher: it looks like malformed HTML to me, so I doubt a HTML parser would be able to work with it very easily. – Blender Sep 13 '12 at 02:15
  • 1
    @Blender: I think youÄre right. But the substitution seems to be no valid XML either. I wonder, what does one need broken XML for? – pillmuncher Sep 13 '12 at 10:11

1 Answers1

12

re.MULTILINE doesn't actually make your regex multiline in the way you want it to be.

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

re.DOTALL does:

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Change flags=re.MULTILINE to flags=re.DOTALL and your regex will work.

Blender
  • 289,723
  • 53
  • 439
  • 496