string input = "<table>
<tr>
<td>Text A</td>
</tr>
<tr>
<td>
<table> <!-- Notice this is an inner scope table -->
<tr>
<td>Text B</td>
</tr>
</table>
</td>
</tr>
</table>
<table>
<tr>
<td>
<table> <!-- Notice this is an inner scope table -->
<tr>
<td>Text C</td>
</tr>
</table>
</td>
</tr>
</table>
<table>
<tr>
<td>Text D</td>
</tr>
</table>"
I have a series of tables in the above string format.
I want to extract out the content in the first level of all <tr>
, where the expected extracted content is:
Text A
<table>
<tr>
<td>Text B</td>
</tr>
</table>
<table>
<tr>
<td>Text C</td>
</tr>
</table>
Text D
I have the following Regex that describes what I am trying to do
var regexTableRow = new Regex("<tr><td>(.*?)</td></tr>");
var regexMatches = regexTableRow.Matches(htmlInput);
var tableRows = new List<string>();
foreach (Match match in regexMatches)
{
// Get a row of <tr></tr> out
var value = match.Value;
tableRows.Add(value);
}
Where the Regex fails is it extracts the <tr>
from the inner tables instead of outer tables. How do you make Regex focus only on outer tables during extraction?
Thanks.
[Edit] - Thank you, I will use HtmlAgilityPack instead. Similar issue is being faced with this code:
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlInput);
var output = htmlDocument.DocumentNode
.SelectNodes("table/tr");
Where the inner tables are being picked up instead of the outer tables.