0

Is it possible to use regex to remove HTML tags inside a particular block of HTML?

E.g.

<body>

<p>Hello World!</p>

<table>
    <tr>
        <td> 
          <p>My First HTML Table</p>
        </td>
    </tr>
</table>

I don't want to remove all P tags, only those within the table element.

The ability to both remove or retain the text inside the nested p tag would be ideal.

Thanks.

Calum
  • 5,308
  • 1
  • 22
  • 27
Jamie Carruthers
  • 685
  • 1
  • 8
  • 22
  • Inside a particular block of HTML? Sure. `s[

    My First HTML Table

    ][My First HTML Table]` — but for any general solution, use a real HTML parser.
    – Quentin Apr 18 '11 at 10:10
  • 4
    I must refer you to the canonical answer to any question involving HTML and regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Andrew Shepherd Apr 18 '11 at 10:12
  • @Andrew - certainly my favourite answer ever - and I guess THE favourite answer on all of SE :-) – Rory Alsop Apr 18 '11 at 10:19

4 Answers4

5

There are a lot of mentions regarding not to use regex when parsing HTML, so you could use Html Agility Pack for this:

var html = @"
<body>

<p>Hello World!</p>

<table>
    <tr>
        <td> 
          <p>My First HTML Table</p>
        </td>
    </tr>
</table>";

HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);

var nodes = document.DocumentNode.SelectNodes("//table//p");
foreach (HtmlNode node in nodes)
{
    node.ParentNode.ReplaceChild(
        HtmlNode.CreateNode(node.InnerHtml),
        node
    );
}

string result = null;
using (StringWriter writer = new StringWriter())
{
    document.Save(writer);
    result = writer.ToString();
}

So after all these manupulations, you'll get the next result:

<body>

<p>Hello World!</p>

<table>
    <tr>
        <td> 
          My First HTML Table
        </td>
    </tr>
</table></body>
carla
  • 1,970
  • 1
  • 31
  • 44
Oleks
  • 31,955
  • 11
  • 77
  • 132
1
<td>[\r\n\s]*<p>([^<]*)</p>[\r\n\s]*</td>

The round brackets denote a numbered capture group which will contain your text.

However, using regular expressions in this way relies on a lot of assumptions regarding the content of the <p> tag and the construction of the HTML.

Have a read of the ubiquitous SO question regarding using regular expressions to parse (X)HTML and see @Bruno's answer for a more robust solution.

Community
  • 1
  • 1
Town
  • 14,706
  • 3
  • 48
  • 72
1

I have found this link in which it seems the exact question was asked

"I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>") if it's inside a table (between and ). For example:"

Regex to delete HTML within <table> tags

Community
  • 1
  • 1
Bruno
  • 1,944
  • 13
  • 22
0

Possible to some extent but not reliable!

I will rather suggest you to look at HTML parsers such as HTML Agility Pack.

VinayC
  • 47,395
  • 5
  • 59
  • 72