So I have an HTML table in a string. Most of this HTML came from FrontPage so it is mostly badly formatted. Here's a quick sample of what it looks like.
<b>Table 1</b>
<table class='class1'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p><b>Table 2</b></p>
<table class='class2'>
<tr>
<td>
<p>Procedure Name</td>
<td>
<p>Procedure</td>
</tr>
</table>
<p> Some text is here</p>
From what I understand, FrontPage automatically adds a <p>
in every new cell.
I want to remove those <p>
tags that are inside the tables but keep the ones outside the tables. I tried 2 methods so far:
First method
First method was to use a single RegEx tp capture every <p>
tag in the tables and then to Regex.Replace()
to remove them. However I never managed to get the right RegEx for this. (I know parsing HTML with RegEx is bad. I thought the data was simple enough to apply RegEx to it).
I can get everything in each table quite easily using this regex: <table.*?>(.*?)</table>
Then I wanted to only grab the <p>
tags so I wrote this: (?<=<table.*?>)(<p>)(?=</table>)
. This doesn't match anything. (Apparently .NET allows quantifiers in their lookbehinds. At least that's the impression I had while using http://regexhero.net/tester/)
Any way I can modify this RegEx to capture only what I need?
Second method
Second method was to capture only the table contents into a string and then String.Replace()
to remove the <p>
tags. I'm using the following code to capture the matches:
MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);
htmlSource
is a string containing the whole HTML page and this variable is what will be sent back to the client after processing. I want to remove only what I need to remove from htmlSource
.
How can I use the MatchCollection to remove the <p>
tags and then send the updated tables back to htmlSource
?
Thank you
` tags, but replacing them might not be able to be done this way.
– Jun 08 '15 at 17:50