Remove parts of Regex.Match string

Question

So I have an HTML table in a string. Most of this HTML came from FrontPage so it is mostly badly formatted. Here's a quick sample of what it looks like.

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

From what I understand, FrontPage automatically adds a <p> in every new cell.

I want to remove those <p> tags that are inside the tables but keep the ones outside the tables. I tried 2 methods so far:

First method

First method was to use a single RegEx tp capture every <p> tag in the tables and then to Regex.Replace() to remove them. However I never managed to get the right RegEx for this. (I know parsing HTML with RegEx is bad. I thought the data was simple enough to apply RegEx to it).

I can get everything in each table quite easily using this regex: <table.*?>(.*?)</table>

Then I wanted to only grab the <p> tags so I wrote this: (?<=<table.*?>)(<p>)(?=</table>). This doesn't match anything. (Apparently .NET allows quantifiers in their lookbehinds. At least that's the impression I had while using http://regexhero.net/tester/)

Any way I can modify this RegEx to capture only what I need?

Second method

Second method was to capture only the table contents into a string and then String.Replace() to remove the <p> tags. I'm using the following code to capture the matches:

MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);

htmlSource is a string containing the whole HTML page and this variable is what will be sent back to the client after processing. I want to remove only what I need to remove from htmlSource.

How can I use the MatchCollection to remove the <p> tags and then send the updated tables back to htmlSource?

Thank you

It's generally perceived to be [bad practice to try to parse HTML with regex](http://stackoverflow.com/a/1732454/791010), but HTML generated by Frontpage? That's a whole new level... — James Thorpe, Jun 08 '15 at 15:47
@JamesThorpe i guess HTML parser won't be able to read invalid HTML like this, so maybe there is no other option. — Alex Zhukovskiy, Jun 08 '15 at 15:48
@Alex A parser stands a much better chance of dealing with it than a regex probably ever will... Also, I don't see anything particularly invalid with what the OP has posted? — James Thorpe, Jun 08 '15 at 15:49
@JamesThorpe I agree that parser is the best option in most cases, but common parser just throws an exception in such cases. — Alex Zhukovskiy, Jun 08 '15 at 15:51
You could use `MatchCollection` to find all the inside `
` tags, but replacing them might not be able to be done this way. — , Jun 08 '15 at 17:50

score 1 · Accepted Answer · answered Jun 08 '15 at 17:59

This answer is based on the second suggested approach. Changed Regex to match everything inside table to :

<table.*?table>

And used Regex.Replace specifying MatchEvaluator to behave with desired replacing:

Regex myRegex = new Regex(@"<table.*?table>", RegexOptions.Singleline);
string replaced = myRegex.Replace(htmlSource, m=> m.Value.Replace("<p>",""));
Console.WriteLine(replaced);

Output using question input:

<b>Table 1</b>
    <table class='class1'>
    <tr>
    <td>
        Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
    </table>
<p><b>Table 2</b></p>
    <table class='class2'>
    <tr>
        <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
    </table>
<p> Some text is here</p>

score 1 · Answer 2 · answered Jun 08 '15 at 18:11

I guess by using a delegate (callback) it could be done.

string html = @"
<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>
";

Regex RxTable = new Regex( @"(?s)(<table[^>]*>)(.+?)(</table\s*>)" );
Regex RxP = new Regex( @"<p>" );

string htmlNew = RxTable.Replace( 
    html,
    delegate(Match match)
    {
       return match.Groups[1].Value + RxP.Replace(match.Groups[2].Value, "") + match.Groups[3].Value;
    }
);
Console.WriteLine( htmlNew );

Output:

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

score 0 · Answer 3 · answered Jun 08 '15 at 16:00

Generally regex allows you to work with nested structures, it's very ugly and you should avoid it, but if you haven't other option, you can use it.

static void Main()
{
    string s = 
@"A()
{
    for()
    {
    }
    do
    {
    }
}
B()
{
    for()
    {
    }   
}
C()
{
    for()
    {
        for()
        {
        }
    }   
}";

    var r = new Regex(@"  
                      {                       
                          (                 
                              [^{}]           # everything except braces { }   
                              |
                              (?<open>  { )   # if { then push
                              |
                              (?<-open> } )   # if } then pop
                          )+
                          (?(open)(?!))       # true if stack is empty
                      }                                                                  

                    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

    int counter = 0;

    foreach (Match m in r.Matches(s))
        Console.WriteLine("Outer block #{0}\r\n{1}", ++counter, m.Value);

    Console.Read();
}

here regex "knows" where block starts and where it ends, so you can use this information to remove <p> tag if it haven't appropriate closing one.

My main issue is not with dealing with the `
` tags without matching closing tags because I simply want to remove them, even if they have a matching closing tag. My issue I can't match or remove only the tags that are _inside_ a table. Whether or not they have matching closing tags — Joeh Perron, Jun 08 '15 at 16:45

Remove parts of Regex.Match string

First method

Second method

3 Answers3