C# Beginner: Delete ALL between two characters in a string (Regex?)

Question

i have a string with an html code. i want to remove all html tags. so all characters between < and >.

This is my code snipped:

WebClient wClient = new WebClient();
SourceCode = wClient.DownloadString( txtSourceURL.Text );
txtSourceCode.Text = SourceCode;
//remove here all between "<" and ">"
txtSourceCodeFormatted.Text = SourceCode;

hope somebody can help me

What if `<` and `>` characters occur inside comments, scripts, strings etc.? — Tim Pietzcker, Dec 01 '13 at 14:47
No, do not use Regex to parse HTML strings. A real nightmare is waiting for you. This is one of the most upvoted answer in SO. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ The best approach is to use a specialized HTML parser like [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) — Steve, Dec 01 '13 at 14:47
Using the .NET XML-Parser might also work in this case? Or am I wrong here? — marsze, Dec 01 '13 at 15:10

Aage · Accepted Answer · 2019-04-30T08:26:15.907

14

Try this:

txtSourceCodeFormatted.Text = Regex.Replace(SourceCode, "<.*?>", string.Empty);

But, as others have mentioned, handle with care.

edited Apr 30 '19 at 08:26

answered Dec 01 '13 at 14:44

Aage

5,932
2
32
57

2

Be warned, though, that this could lead to unexpected behavior in rare cases! – marsze Dec 01 '13 at 15:15

score 3 · Answer 2 · edited May 23 '17 at 11:54

3

According to Ravi's answer, you can use

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

or

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

edited May 23 '17 at 11:54

Community

1
1

answered Dec 01 '13 at 14:52

Vignesh Kumar A

27,863
13
63
115

C# Beginner: Delete ALL between two characters in a string (Regex?)

2 Answers2

Linked

Related