java regex replace all html tags except br

Question

I need a regular expression that can be used with replaceall to replace all the html tags with empty string except any variations of br to maintain the line breaks.

I found the following to replace all html tags <\s*br\s*\[^>]

Use an HTML DOM parser instead. Regex cannot cover every possibility that an HTML tag can present. And @HovercraftFullOfEels, have you considered Tony the Pony? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Jonathan M, Nov 18 '11 at 17:47
I found the following to replace all html tags <\\s*br\\s*\\[^>] — user373201, Nov 18 '11 at 17:57
@JonathanM: I don't think that's true. A single HTML tag doesn't have any recursive nesting, or anything like that; I don't see why a regex couldn't match it. — ruakh, Nov 18 '11 at 17:59
@ruakh, the closest anyone here has come to doing it is the illustrious Tom Christ, who employed multiple convoluted regexes in Perl. But if you read the full post, he says it is not the best way to attack the problem. Don't be fooled by his headline. He's quite clear that it's a bad idea. http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491 — Jonathan M, Nov 18 '11 at 18:47
@JonathanM: Re: its being a bad idea: Well, obviously. No one is saying otherwise! ;-) But you said that it *cannot* be done. — ruakh, Nov 18 '11 at 19:45
@ruakh, I'm not sure it can be done with a single regex. I dunno. Every time someone comes up with a regex to do it, there's an exception. I'll go ahead and say it can't be done just to be a catalyst to someone who wants to spend the time proving otherwise. :) — Jonathan M, Nov 19 '11 at 01:44

score 4 · Accepted Answer · edited May 23 '17 at 11:51

You might get some answers that claim to work.

Those answers might even work for the particular cases you try them against.

But know that regular expressions (which I'm fond of in general) are the wrong tool for the job in this case.

And as your project evolves and needs to cover more complex HTML inputs, the regular expression will get more and more convoluted, and there may well come a time when it simply cannot solve your problem anymore, period.

Do it the right way from the beginning. Use an HTML parser, not a regex.

For reference, here are some related SO posts:

Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:

Wow. Bobince has me convinced. I feel the need to pray actually. — EdgeCase, Nov 21 '11 at 15:18

ruakh · Answer 2 · 2011-11-18T19:55:18.677

If the HTML is known to be valid, then you can use this regex (case-insensitive):

<(?!br\b)/?[a-z]([^"'>]|"[^"]*"|'[^']*')*>

but it can fail in interesting ways if you give it invalid HTML. Also, I took "HTML tags" pretty literally; the above won't cover  and <!DOCTYPE declarations>, and won't convert <![CDATA[ blocks ]]> and &entity;s to plain text.

It's probably better to take a step back, think about why you want to strip out these HTML tags — that is, what you're actually trying to achieve — and then find an HTML-handling library that offers a better way to achieve that goal. HTML cleaning is really a solved problem; you shouldn't need to reinvent it.

UPDATE: I've just realized that, even for valid HTML, the above has some major limitations. For example, it will mishandle something like  (converting it to just <!--), and also something like <script><foo></script> (since HTML proper has a small number of tags with CDATA content, that is, everything after the start-tag until the first </ is taken to be character data, not containing HTML tags; fortunately, XHTML was forced to get rid of this concept due to XML's lack of support for it). Both of these limitations can be addressed, of course — using more regexes! — but they should help reinforce the point that you should use a well-tested HTML-handling library rather than trying to roll your own regexes. If you have a lot of guarantees about the nature of the HTML you're trying to handle, then regexes can be useful; but if what you're trying to do is strip out arbitrary tags, then that's a good sign that you don't have these sorts of guarantees.

+1 for not re-inventing the wheel... unless this is an assignment. — Alb, Nov 18 '11 at 18:00

java regex replace all html tags except br

2 Answers2

Linked