I need to replace all the HTML tags (e.g. <p>
, <img>
, etc.) in a web page source code, but I want to keep <br>
and <br/>
. I have tried:
re.sub(r'<[^>]+?>', u'', html, flags=re.I)
This only achieves the first goal, but it cannot keep <br>
or <br/>
. r'<[^>br]+?>'
wont achieve the goal either.
What is the correct regular expression?
` are removed. Is this negative look ahead correct? – James King Nov 04 '14 at 11:09
)+)", "
", html, flags=re.I|re.UNICODE)` The problem you had was that you had missed out the `flags` keyword, so it was taking `re.I|re.UNICODE` as the `count` kwarg, limiting it to only the first 33 replacements - which was making it look like nothing was happening, because you were only looking at the last line of the input text. I answered here because there's no way to message you the answer. – will Nov 04 '14 at 12:57