7

i have a string with an html code. i want to remove all html tags. so all characters between < and >.

This is my code snipped:

WebClient wClient = new WebClient();
SourceCode = wClient.DownloadString( txtSourceURL.Text );
txtSourceCode.Text = SourceCode;
//remove here all between "<" and ">"
txtSourceCodeFormatted.Text = SourceCode;

hope somebody can help me

Soner Gönül
  • 97,193
  • 102
  • 206
  • 364
taito
  • 561
  • 2
  • 6
  • 11
  • 1
    What if `<` and `>` characters occur inside comments, scripts, strings etc.? – Tim Pietzcker Dec 01 '13 at 14:47
  • 5
    No, do not use Regex to parse HTML strings. A real nightmare is waiting for you. This is one of the most upvoted answer in SO. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ The best approach is to use a specialized HTML parser like [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) – Steve Dec 01 '13 at 14:47
  • @Steve My favourite SO answer ever :) – Rotem Dec 01 '13 at 15:05
  • Using the .NET XML-Parser might also work in this case? Or am I wrong here? – marsze Dec 01 '13 at 15:10

2 Answers2

14

Try this:

txtSourceCodeFormatted.Text = Regex.Replace(SourceCode, "<.*?>", string.Empty);

But, as others have mentioned, handle with care.

Aage
  • 5,932
  • 2
  • 32
  • 57
3

According to Ravi's answer, you can use

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

or

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");
Community
  • 1
  • 1
Vignesh Kumar A
  • 27,863
  • 13
  • 63
  • 115