Regex expression to remove HTML but with exceptions

Question

In C#, I have the following regex expression to remove HTML from a string:

var regex = new Regex("<[^>]*(>|$)");
return regex.Replace(input, match => "");

There are some cases where we need to allow for double >> and <<. How do I change the above expression to simply skip these double angled brackets?

by "remove HTML" you mean to strip the tags and get just text? — Aziz, Mar 16 '16 at 20:20
@Aziz, I updated the code above. The regex removes the HTML, but it chokes on >> and <<. I like the expression to just ignore these double angle brackets. In fact, they are C++ code snippets. — Paxton, Mar 16 '16 at 20:27
Sigh... regex is the wrong tool for dealing with XML or HTML. What's going to happen when a tag is broken across multiple lines? — Jim Garrison, Mar 16 '16 at 20:28
Yeah, I am aware of the link (don't use Regex for HTML parsing), but our scenario is simple and controlled. — Paxton, Mar 16 '16 at 22:02
@JimGarrison since when does regex care about lines unless specifically told to care about lines? — Nyerguds, Mar 24 '16 at 10:09

Nyerguds · Accepted Answer · 2016-03-17T15:32:07.317

1

Not sure why the $ at the end is in there too, but anyway... negative lookahead and lookbehind can solve this problem:

Regex regex = new Regex("(?<![<])<[^<>]+>(?![>])");
return regex.Replace(input, String.Empty);

This will match any < not preceded by another <, then the content, and then any > not followed by another >.

edited Mar 17 '16 at 15:32

answered Mar 17 '16 at 14:58

Nyerguds

5,360
1
31
63

Regex expression to remove HTML but with exceptions

1 Answers1