how to remove all tag in c# using regex.replace

Question

i want the output with regex.replace :

input :

<h4 class=\"nikstyle_title\"><a rel=\"nofollow\" target=\"_blank\" href="http://www.sample.com">my text</a></h4>

output :

<h4 class=\"nikstyle_title\"> </h4>

[What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) — Alex K., Dec 25 '14 at 14:41
[Please don't](http://stackoverflow.com/a/1732454/11683). Use a parser. — GSerg, Dec 25 '14 at 14:44
tnx Avinash Raj . resolved . the true answer is : ().*?<\/a> — MahdiAliz, Dec 25 '14 at 15:23

mybirthname · Accepted Answer · 2014-12-25T15:19:34.870

You should never use regex to parse html, you need html parser. Here is an example how you can do it.

You need to add this reference in your project:

Install-Package HtmlAgilityPack

The code:

 static void Main(string[] args)
        {
            string html = @"<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

<table>
    <tr>
        <td>A!!</td>
        <td>te2</td>
        <td>2!!</td>
        <td>te43</td>
        <td></td>
        <td> !!</td>
        <td>.!!</td>
        <td>te53</td>
        <td>te2</td>
        <td>texx</td>
    </tr>
</table>

<h4 class=""nikstyle_title""><a rel=""nofollow"" target=""_blank"" href=""http://www.niksalehi.com/ccount/click.php?ref=ZDNkM0xuQmxjbk5wWVc1MkxtTnZiUT09&id=117""><span class=""text-matn-title-bold-black"">my text</span></a></h4>

</body>
</html>";

            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);

            List<HtmlNode> tdNodes = doc.DocumentNode.Descendants().Where(x => x.Name == "h4" && x.Attributes.Contains("class") && x.Attributes["class"].Value.Contains("nikstyle_title")).ToList();


            foreach (HtmlNode node in tdNodes)
            {
                node.InnerHtml = "";
            }

            string html2 = doc.DocumentNode.InnerHtml;
        }

EDIT:

For your second desire -> Remove every <a></a> tag with `href="http://www.sample.com":

    static void Main(string[] args)
        {
            string html = @"<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

<table>
    <tr>
        <td>A!!</td>
        <td>te2</td>
        <td>2!!</td>
        <td>te43</td>
        <td></td>
        <td> !!</td>
        <td>.!!</td>
        <td>te53</td>
        <td>te2</td>
        <td>texx</td>

    </tr>
</table>

<h4 class=""nikstyle_title""><a rel=""nofollow"" target=""_blank"" href=""http://www.sample.com""><span class=""text-matn-title-bold-black"">my text</span></a></h4>
<div><a rel=""nofollow"" target=""_blank"" href=""http://www.sample.com""><span class=""text-matn-title-bold-black"">my text</span></a></div>
</body>
</html>";

            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);

            List<HtmlNode> tdNodes = doc.DocumentNode.Descendants().Where(x => x.Name == "a" && x.Attributes.Contains("href") && x.Attributes["href"].Value.Contains("http://www.sample.com")).ToList();

            foreach (HtmlNode node in tdNodes)
            {

                node.Remove();
            }

            string html2 = doc.DocumentNode.InnerHtml;
        }

Also personally I prefer to use @ for escaping because it is more readable, you can try like in my example. When you are using @ you will escape with double quotes-example: class=""a"";

please read my question again. not just h4 tag, another tag , so i dont know the tag. i want to remove anythings between — MahdiAliz, Dec 25 '14 at 14:56
please read my question again. not just h4 tag, another tag , so i dont know the tag. i want to remove anythings between — MahdiAliz, Dec 25 '14 at 15:00
Write your question better at the moment, my code is doing your desire. — mybirthname, Dec 25 '14 at 15:12
tnx mybirthname for your answer. my problem was solved. List tdNodes = doc.DocumentNode.Descendants().Where(x => x.Name == "a" && x.Attributes.Contains("href")).ToList(); — MahdiAliz, Dec 25 '14 at 15:46

score 0 · Answer 2 · answered Dec 25 '14 at 14:54

0

HtmlAgilityPack is not so universal. Sometimes only regex can save your time. In C# you can use this code:

string htmlString = "";
        var regex = new Regex("<h4 class=\\\"nikstyle_title\\\">(?<delete>.*?)<\\/h4>");
        string replace = regex.Match(htmlString).Groups["delete"].Value;
        htmlString = htmlString.Replace(replace, string.Empty);

Your regex is:

<h4 class=\"nikstyle_title\">(?<delete>.*?)<\/h4>

answered Dec 25 '14 at 14:54

Vladislav

218
1
13

you should never use regex ! Also please tell me in which case HtmlAgilityPack will not save you ? The save is pretty easy in this case. – mybirthname Dec 25 '14 at 14:55
Some websites made wrong. For exmple: bla bla bla some text without tag, but i need it .... – Vladislav Dec 25 '14 at 14:59
please read my question again. not just h4 tag, another tag , so i dont know the tag. i want to remove anythings between – MahdiAliz Dec 25 '14 at 15:05
When the html is invalid you should fix the html not the method with which you are going to read it ! – mybirthname Dec 25 '14 at 15:21
Hehe, this should work if you are parsing your website. But what if i want to parse yahoo news for example? – Vladislav Dec 26 '14 at 08:04

how to remove all tag in c# using regex.replace

2 Answers2