I'm writing a c# console application to retrieve table info from an external html web page.
Example web page: (chessnuts.org)
I want to extract all <td>
records for data
,match
,opponent
,result
etc - 23 rows in example link above.
I've no control of this web page which unfortunately isn't well formatted so options I've tried like the HtmlAgilityPack
and XML
parsing simply fail. I have also tried a number for RegEx's but my knowledge of this is extremely poor, an example I tried below:
string[] trs = Regex.Matches(html,
@"<tr[^>]*>(?<content>.*)</tr>",
RegexOptions.Multiline)
.Cast<Match>()
.Select(t => t.Groups["content"].Value)
.ToArray();
This returns a complete list of all <tr>
's (with many records I don't need) but I'm then unable to get the data from this.
UPDATE
Here is an example of the use of HtmlAgilityPack
I tried:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
foreach (HtmlNode cell in row.SelectNodes("td"))
{
Console.WriteLine(cell.InnerText);
}
}
}