1
s = re.sub(r"<style.*?</style>", "", s)

Isn't this code supposed to remove styles in the s string? Why does it not work? I am trying to remove the following code:

<style type="text/css">
body { ... }
</style>

Any suggestion?

Shaokan
  • 7,438
  • 15
  • 56
  • 80
  • Everytime I see regex parsing HTML, I remember this question: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Utku Zihnioglu Aug 11 '11 at 23:28

1 Answers1

6

No it's the re.DOTALL flag that is necessary !

re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

http://docs.python.org/library/re.html#re.DOTALL

Edit

In some cases, it may be necessary to have a dot matching all characters (newlines comprised) in a region of a string, and to have a dot matching only non newlines characters in another region of the sting. But using flag re.DOTALL doesn't allow this.

In this case, it's usefull to know the following trick: using [\s\S] to symbolize every character

import re

s = '''alhambra
<style type="text/css">
body { ... }
</style>
toromizuXXXXXXXX
YYYYYYYYYYYYYY'''
print s,'\n'

regx = re.compile("<style[\s\S]*?</style>|(?<=ro)mizu.+")

s = regx.sub('AAA',s)
print s

result

alhambra
<style type="text/css">
body { ... }
</style>
toromizuXXXXXXXX
YYYYYYYYYYYYYY 

alhambra
AAA
toroAAA
YYYYYYYYYYYYYY
eyquem
  • 26,771
  • 7
  • 38
  • 46
  • Yes correct, I just came back to say that I've found the solution but here you are! Good answer! – Shaokan Aug 11 '11 at 23:05