Regular Expression Processing HTML

Question

I need to replace all the HTML tags (e.g. , <img>, etc.) in a web page source code, but I want to keep   and  . I have tried:

re.sub(r'<[^>]+?>', u'', html, flags=re.I)

This only achieves the first goal, but it cannot keep   or  . r'<[^>br]+?>' wont achieve the goal either.

What is the correct regular expression?

Don't use regular expressions for manipulating HTML - HTML **is not** a regular language. Use an HTML parser. ([Amusing version.](http://stackoverflow.com/a/1732454/3001761)) — jonrsharpe, Nov 04 '14 at 10:39
@jonrsharpe I know you are refering to BeautifulSoup etc. But I do not want to install another plugin for this simple problem. — James King, Nov 04 '14 at 10:40
There is a parser in the standard library, too: [`HTMLParser`](https://docs.python.org/2/library/htmlparser.html). — jonrsharpe, Nov 04 '14 at 10:41
Will come a day when people will stop trying to unscrew a bolt with a hammer (aka parse html with regex)? — Mauro Baraldi, Nov 04 '14 at 10:53
@AvinashRaj Can you recheck your answer? Still all the tags including `
` are removed. Is this negative look ahead correct? — James King, Nov 04 '14 at 11:09
@MauroBaraldi, probably around the same time they stop using [double-clawed hammers](http://blog.codinghorror.com/the-php-singularity/) — Paul Draper, Nov 04 '14 at 11:11
@AvinashRaj you include an extra `<` inside the brace before `br`. It should be something like this: `<((?!br).)*>` — James King, Nov 04 '14 at 11:15
I have the correct answer to the other question you asked, which you deleted for some reason. `re.sub(r"((
)+)", "
", html, flags=re.I|re.UNICODE)` The problem you had was that you had missed out the `flags` keyword, so it was taking `re.I|re.UNICODE` as the `count` kwarg, limiting it to only the first 33 replacements - which was making it look like nothing was happening, because you were only looking at the last line of the input text. I answered here because there's no way to message you the answer. — will, Nov 04 '14 at 12:57

score 0 · Answer 1 · answered Nov 04 '14 at 12:54

0

The below regex with negative lookahead assertion would work.

<(?!br\/?>)[^<>]*>

DEMO

answered Nov 04 '14 at 12:54

Avinash Raj

172,303
28
230
274

vks · Accepted Answer · 2014-11-04T14:31:53.693

-1

<((?!\bbr\b).)*?>

This should work for your case.The negative lookahead will ensure   is not picked.

Edit:

<(?:(?!\bbr\/?(?=>)).)*?>

Try this if you have such absurd things. <a href="http://host.domain.tld/br">

See demo.

http://regex101.com/r/sU3fA2/57

edited Nov 04 '14 at 14:31

answered Nov 04 '14 at 10:40

vks

67,027
10
91
124

This wont work <((?!
|<\/br>)[^>])+?>. Some other tags are kept. – James King Nov 04 '14 at 10:56
Still all the tags (including `
`) are removed. Is this negative lookahead correct. – James King Nov 04 '14 at 11:06
you seem to make a tiny mistake. This works: `<((?!br)[^>])+?>` you include an extra `<` before `br` inside the brace. Is it right? – James King Nov 04 '14 at 11:14
Why do you include the `|\/br`? There are only three versions of `br`: `
,
,
` – James King Nov 04 '14 at 11:19
you better use `The negative lookahead` instead of `The lookahead` in your answer, so newbies wont be confused. Just my humble suggestion. The lookahead is one of the most difficule parts in regex. – James King Nov 04 '14 at 11:24
@vks Hi what is the `?:` in `<(?:(?!\bbr\/?(?=>)).)*?>`? What does it do? – James King Apr 16 '15 at 23:30
@JamesKing `?:` means non capturing group.So if you do a `findall` it will give the whole string as by default `findall` returns only captured group.Or if you use `match` if wont be stroed in `match.group(1)` – vks Apr 17 '15 at 04:08

Regular Expression Processing HTML

2 Answers2