1
string input = "<table>
    <tr>
        <td>Text A</td>
    </tr>
    <tr>
        <td>
            <table>  <!-- Notice this is an inner scope table -->
                <tr>
                    <td>Text B</td>
                </tr>
            </table>
        </td>
    </tr>
</table>

<table>
    <tr>
        <td>
            <table> <!-- Notice this is an inner scope table -->
                <tr>
                    <td>Text C</td>
                </tr>
            </table>
        </td>
    </tr>
</table>

<table>
    <tr>
        <td>Text D</td>
    </tr>
</table>"

I have a series of tables in the above string format.

I want to extract out the content in the first level of all <tr>, where the expected extracted content is:

Text A

<table>
    <tr>
        <td>Text B</td>
    </tr>
</table>

<table>
    <tr>
        <td>Text C</td>
    </tr>
</table>

Text D

I have the following Regex that describes what I am trying to do

var regexTableRow = new Regex("<tr><td>(.*?)</td></tr>");

        var regexMatches = regexTableRow.Matches(htmlInput);

        var tableRows = new List<string>();

        foreach (Match match in regexMatches)
        {
            // Get a row of <tr></tr> out
            var value = match.Value;

            tableRows.Add(value);
        }

Where the Regex fails is it extracts the <tr> from the inner tables instead of outer tables. How do you make Regex focus only on outer tables during extraction?

Thanks.

[Edit] - Thank you, I will use HtmlAgilityPack instead. Similar issue is being faced with this code:

var htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(htmlInput);

            var output = htmlDocument.DocumentNode
                .SelectNodes("table/tr");

Where the inner tables are being picked up instead of the outer tables.

taylorswiftfan
  • 1,371
  • 21
  • 35
  • 6
    Regex is not the right tools for this. Use the [HtmlAgilityPack](https://www.nuget.org/packages/HtmlAgilityPack/). – Olivier Jacot-Descombes Oct 13 '19 at 21:03
  • 3
    Use HtmlAgilityPack to parse HTML and extract relevant nodes. Regex is too hardcore for those tasks. – eocron Oct 13 '19 at 21:03
  • 1
    As mentioned, regex is not the appropriate means. But for interest, you'd need [balancing groups](https://weblogs.asp.net/whaggard/377025), something [`\s*\s*((?>(?)|<(?!/?tr)|[^<]+|(?<-c>))*(?(c)(?!)))\s*\s*`](http://regexstorm.net/tester?p=%3ctr%3e%5cs*%3ctd%3e%5cs*%28%28%3f%3e%3ctr%3e%28%3f%3cc%3e%29%7c%3c%28%3f!%2f%3ftr%29%7c%5b%5e%3c%5d%2b%7c%3c%2ftr%3e%28%3f%3c-c%3e%29%29*%28%3f%28c%29%28%3f!%29%29%29%5cs*%3c%2ftd%3e%5cs*%3c%2ftr%3e&i=%3ctable%3e%0d%0a++++%3ctr%3e%0d%0a++++++++%3ctd%3eText+D%3c%2ftd%3e%0d%0a++++%3c%2ftr%3e%0d%0a%3c%2ftable) **group 1** – bobble bubble Oct 13 '19 at 22:28

1 Answers1

0

It is frowned upon to do that with regular expressions, yet if you have to, you might define some boundaries, such as with:

(?<=<table>)\s*<tr>\s*<td>([a-z0-9 ]*)<\/td>\s*<\/tr>

otherwise, it'd become pretty complicated.

Test

using System;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        string pattern = @"(?<=<table>)\s*<tr>\s*<td>([a-z0-9 ]*)<\/td>\s*<\/tr>";
        string input = @"<table>
    <tr>
        <td>Text A</td>
    </tr>
    <tr>
        <td>
            <table>  <!-- Notice this is an inner scope table -->
                <tr>
                    <td>Text B</td>
                </tr>
            </table>
        </td>
    </tr>
</table>

<table>
    <tr>
        <td>
            <table> <!-- Notice this is an inner scope table -->
                <tr>
                    <td>Text C</td>
                </tr>
            </table>
        </td>
    </tr>
</table>

<table>
    <tr>
        <td>Text D</td>
    </tr>
</table>";
        RegexOptions options = RegexOptions.Singleline | RegexOptions.IgnoreCase;
        
        foreach (Match m in Regex.Matches(input, pattern, options))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    }
}

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69