0

In C#, I have the following regex expression to remove HTML from a string:

var regex = new Regex("<[^>]*(>|$)");
return regex.Replace(input, match => "");

There are some cases where we need to allow for double >> and <<. How do I change the above expression to simply skip these double angled brackets?

Paxton
  • 3
  • 1
  • 4
  • by "remove HTML" you mean to strip the tags and get just text? – Aziz Mar 16 '16 at 20:20
  • @Aziz, I updated the code above. The regex removes the HTML, but it chokes on >> and <<. I like the expression to just ignore these double angle brackets. In fact, they are C++ code snippets. – Paxton Mar 16 '16 at 20:27
  • 3
    Sigh... regex is the wrong tool for dealing with XML or HTML. What's going to happen when a tag is broken across multiple lines? – Jim Garrison Mar 16 '16 at 20:28
  • 3
    Mandatory link: http://stackoverflow.com/a/1732454/4037348 – Gediminas Masaitis Mar 16 '16 at 20:30
  • 1
    Yeah, I am aware of the link (don't use Regex for HTML parsing), but our scenario is simple and controlled. – Paxton Mar 16 '16 at 22:02
  • @JimGarrison since when does regex care about lines unless specifically told to care about lines? – Nyerguds Mar 24 '16 at 10:09

1 Answers1

1

Not sure why the $ at the end is in there too, but anyway... negative lookahead and lookbehind can solve this problem:

Regex regex = new Regex("(?<![<])<[^<>]+>(?![>])");
return regex.Replace(input, String.Empty);

This will match any < not preceded by another <, then the content, and then any > not followed by another >.

Nyerguds
  • 5,360
  • 1
  • 31
  • 63