Remove HTML with Regex

Question

Is it possible to use regex to remove HTML tags inside a particular block of HTML?

E.g.

<body>

<p>Hello World!</p>

<table>
    <tr>
        <td> 
          <p>My First HTML Table</p>
        </td>
    </tr>
</table>

I don't want to remove all P tags, only those within the table element.

The ability to both remove or retain the text inside the nested p tag would be ideal.

Thanks.

Inside a particular block of HTML? Sure. `s[
My First HTML Table
][My First HTML Table]` — but for any general solution, use a real HTML parser. — Quentin, Apr 18 '11 at 10:10
I must refer you to the canonical answer to any question involving HTML and regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Andrew Shepherd, Apr 18 '11 at 10:12
@Andrew - certainly my favourite answer ever - and I guess THE favourite answer on all of SE :-) — Rory Alsop, Apr 18 '11 at 10:19

score 5 · Accepted Answer · edited Nov 28 '17 at 15:17

There are a lot of mentions regarding not to use regex when parsing HTML, so you could use Html Agility Pack for this:

var html = @"
<body>

<p>Hello World!</p>

<table>
    <tr>
        <td> 
          <p>My First HTML Table</p>
        </td>
    </tr>
</table>";

HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);

var nodes = document.DocumentNode.SelectNodes("//table//p");
foreach (HtmlNode node in nodes)
{
    node.ParentNode.ReplaceChild(
        HtmlNode.CreateNode(node.InnerHtml),
        node
    );
}

string result = null;
using (StringWriter writer = new StringWriter())
{
    document.Save(writer);
    result = writer.ToString();
}

So after all these manupulations, you'll get the next result:

<body>

<p>Hello World!</p>

<table>
    <tr>
        <td> 
          My First HTML Table
        </td>
    </tr>
</table></body>

score 1 · Answer 2 · edited May 23 '17 at 12:18

<td>[\r\n\s]*<p>([^<]*)</p>[\r\n\s]*</td>

The round brackets denote a numbered capture group which will contain your text.

However, using regular expressions in this way relies on a lot of assumptions regarding the content of the <p> tag and the construction of the HTML.

Have a read of the ubiquitous SO question regarding using regular expressions to parse (X)HTML and see @Bruno's answer for a more robust solution.

score 1 · Answer 3 · edited May 23 '17 at 09:58

1

I have found this link in which it seems the exact question was asked

"I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>") if it's inside a table (between and ). For example:"

Regex to delete HTML within <table> tags

edited May 23 '17 at 09:58

Community

1
1

answered Apr 18 '11 at 10:17

Bruno

1,944
13
22

score 0 · Answer 4 · answered Apr 18 '11 at 10:48

0

Possible to some extent but not reliable!

I will rather suggest you to look at HTML parsers such as HTML Agility Pack.

answered Apr 18 '11 at 10:48

VinayC

47,395
5
59
72

Remove HTML with Regex

4 Answers4