-1

I need to replace all the HTML tags (e.g. <p>, <img>, etc.) in a web page source code, but I want to keep <br> and <br/>. I have tried:

re.sub(r'<[^>]+?>', u'', html, flags=re.I)

This only achieves the first goal, but it cannot keep <br> or <br/>. r'<[^>br]+?>' wont achieve the goal either.

What is the correct regular expression?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
James King
  • 1,574
  • 4
  • 19
  • 28
  • 1
    Don't use regular expressions for manipulating HTML - HTML **is not** a regular language. Use an HTML parser. ([Amusing version.](http://stackoverflow.com/a/1732454/3001761)) – jonrsharpe Nov 04 '14 at 10:39
  • @jonrsharpe I know you are refering to BeautifulSoup etc. But I do not want to install another plugin for this simple problem. – James King Nov 04 '14 at 10:40
  • 2
    There is a parser in the standard library, too: [`HTMLParser`](https://docs.python.org/2/library/htmlparser.html). – jonrsharpe Nov 04 '14 at 10:41
  • 2
    Will come a day when people will stop trying to unscrew a bolt with a hammer (aka parse html with regex)? – Mauro Baraldi Nov 04 '14 at 10:53
  • @AvinashRaj Can you recheck your answer? Still all the tags including `
    ` are removed. Is this negative look ahead correct?
    – James King Nov 04 '14 at 11:09
  • @MauroBaraldi, probably around the same time they stop using [double-clawed hammers](http://blog.codinghorror.com/the-php-singularity/) – Paul Draper Nov 04 '14 at 11:11
  • @AvinashRaj you include an extra `<` inside the brace before `br`. It should be something like this: `<((?!br).)*>` – James King Nov 04 '14 at 11:15
  • 1
    I have the correct answer to the other question you asked, which you deleted for some reason. `re.sub(r"((
    )+)", "
    ", html, flags=re.I|re.UNICODE)` The problem you had was that you had missed out the `flags` keyword, so it was taking `re.I|re.UNICODE` as the `count` kwarg, limiting it to only the first 33 replacements - which was making it look like nothing was happening, because you were only looking at the last line of the input text. I answered here because there's no way to message you the answer.
    – will Nov 04 '14 at 12:57

2 Answers2

0

The below regex with negative lookahead assertion would work.

<(?!br\/?>)[^<>]*>

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
-1
<((?!\bbr\b).)*?>

This should work for your case.The negative lookahead will ensure <br> is not picked.

Edit:

<(?:(?!\bbr\/?(?=>)).)*?>

Try this if you have such absurd things. <a href="http://host.domain.tld/br">

See demo.

http://regex101.com/r/sU3fA2/57

vks
  • 67,027
  • 10
  • 91
  • 124
  • This wont work <((?!
    |<\/br>)[^>])+?>. Some other tags are kept.
    – James King Nov 04 '14 at 10:56
  • Still all the tags (including `
    `) are removed. Is this negative lookahead correct.
    – James King Nov 04 '14 at 11:06
  • you seem to make a tiny mistake. This works: `<((?!br)[^>])+?>` you include an extra `<` before `br` inside the brace. Is it right? – James King Nov 04 '14 at 11:14
  • Why do you include the `|\/br`? There are only three versions of `br`: `
    ,
    ,
    `
    – James King Nov 04 '14 at 11:19
  • you better use `The negative lookahead` instead of `The lookahead` in your answer, so newbies wont be confused. Just my humble suggestion. The lookahead is one of the most difficule parts in regex. – James King Nov 04 '14 at 11:24
  • @vks Hi what is the `?:` in `<(?:(?!\bbr\/?(?=>)).)*?>`? What does it do? – James King Apr 16 '15 at 23:30
  • @JamesKing `?:` means non capturing group.So if you do a `findall` it will give the whole string as by default `findall` returns only captured group.Or if you use `match` if wont be stroed in `match.group(1)` – vks Apr 17 '15 at 04:08