2

I would like a regex to remove html tags and &nbsp, &quot etc from a string. The regex I have is to remove the html tags but not the others mentioned. I'm using .Net 4

Thanks

CODE:

     String result = Regex.Replace(blogText, @"<[^>]*>", String.Empty);
dragfyre
  • 452
  • 4
  • 11
Mark
  • 141
  • 1
  • 4
  • 12
  • 1
    Before you proceed, take a look here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Zruty May 19 '11 at 15:58
  • Regex and HTML are never a good mix. Have a look @ http://stackoverflow.com/questions/5496704/strip-html-and-css-in-c – Michael Paulukonis May 19 '11 at 16:00
  • this could be easily done with HtmlAgilityPack, see [Stripping all html tags with Html Agility Pack](http://stackoverflow.com/q/3140919/102112) – Oleks May 19 '11 at 16:16

2 Answers2

1

Don't use Regular Expressions, use the HTML Agility pack:

http://www.codeplex.com/htmlagilitypack

Tom Gullen
  • 61,249
  • 84
  • 283
  • 456
0

If you want to build on what you what you already created, you can change it to the following:

String result = Regex.Replace(blogText, @"<[^>]*>|&\w+", String.Empty);

It means...

  1. Either match tags as you defined...
  2. ...or match a & followed by at least one word character \w -- as many as possible.

Neither of these two work in all nasty cases, but usually it does.

Staffan Nöteberg
  • 4,095
  • 1
  • 19
  • 17