Parse badly formatted HTML for table data

Question

I'm writing a c# console application to retrieve table info from an external html web page.

I want to extract all <td> records for data,match,opponent,result etc - 23 rows in example link above.

I've no control of this web page which unfortunately isn't well formatted so options I've tried like the HtmlAgilityPack and XML parsing simply fail. I have also tried a number for RegEx's but my knowledge of this is extremely poor, an example I tried below:

string[] trs = Regex.Matches(html, 
                             @"<tr[^>]*>(?<content>.*)</tr>", 
                             RegexOptions.Multiline)
                    .Cast<Match>()
                    .Select(t => t.Groups["content"].Value)
                    .ToArray();

This returns a complete list of all <tr>'s (with many records I don't need) but I'm then unable to get the data from this.

UPDATE

Here is an example of the use of HtmlAgilityPack I tried:

 HtmlDocument doc = new HtmlDocument();

        doc.LoadHtml(html);
        foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
        {

            foreach (HtmlNode row in table.SelectNodes("tr"))
            {
                foreach (HtmlNode cell in row.SelectNodes("td"))
                {
                    Console.WriteLine(cell.InnerText);
                }
            }
        }

Try looking at [this SO Question](http://stackoverflow.com/questions/14987878/html-agility-pack-parse-table) — Icemanind, Oct 02 '14 at 22:32
Like I mentioned in my question, using the Html Agility Pack fails because the page has missing closing meta tags. — Matt D. Webb, Oct 02 '14 at 22:34
If it has a specific limited problem with the meta tags, why not do some html.Replace("malformed meta","better meta") and fix them? — MatthewMartin, Oct 02 '14 at 22:37
@MatthewMartin Assuming I have interpreted your question correctly I'd have to know where these missing tags were. I'm hoping to run this console app on many similar (but not exact) pages. — Matt D. Webb, Oct 02 '14 at 22:39
Isn't it a tautology that a text string that perforce fails parsing as a context free grammar **cannot be a regular expression**? — Pieter Geerkens, Oct 02 '14 at 22:41
I'm not sure why you can't use HtmlAgilityPack.. it should be able to extract what you're after on that page perfectly fine. Let me check ... — Simon Whitehead, Oct 02 '14 at 22:44
Looking at your html code from your sample page, I see nothing wrong with your meta tags that would make Html Agility Pack not be able to parse your html. Even if your HTML is malformed, you could use something like [Tidy.NET](http://tidynet.sourceforge.net/) to make it pretty, then use HTML agility pack — Icemanind, Oct 02 '14 at 22:45
@PieterGeerkens I think the issue here is that the page does not conform to the HTML grammar... you comment is valid only when you assume that the page does not conform to any context free grammar. — SJuan76, Oct 02 '14 at 22:49
I looked at the page, the meta tags look like static strings. replace them with a zero length string & then you can use Html Agility Pack — MatthewMartin, Oct 02 '14 at 22:50
I have updated my question with some HTML agility pack code. — Matt D. Webb, Oct 02 '14 at 22:54
@Webb Can you explain how the HtmlAgilityPack example you showed doesn't fit your needs? It seems to do what you were hoping the regex does.. — Simon Whitehead, Oct 02 '14 at 22:55
@Simon Whitehead - yes, this code fails on the third iteration of the table for each. — Matt D. Webb, Oct 02 '14 at 23:04
@Webb I think you just need to fix it up. I've provided an answer that works for me. — Simon Whitehead, Oct 02 '14 at 23:11

score 1 · Accepted Answer · answered Oct 02 '14 at 23:10

I think you just need to fix your HtmlAgilityPack attempt. This works fine for me:

// Skip the first table on that page so we just get results
foreach (var table in doc.DocumentNode.SelectNodes("//table").Skip(1).Take(1)) {
    foreach (var td in table.SelectNodes("//td")) {
        Console.WriteLine(td.InnerText);
    }
}

This dumps a heap of data from the results table, one columns per line, to the console.

score 0 · Answer 2 · answered Feb 23 '16 at 11:49

If you want a full program:). I looked for this for hours.

class ReadHTML {

    internal void ReadText()
    {
        try
        {
            FolderBrowserDialog fbd = new FolderBrowserDialog();
            fbd.RootFolder = Environment.SpecialFolder.MyComputer;//This causes the folder to begin at the root folder or your documents
            if (fbd.ShowDialog() == DialogResult.OK)
            {
                string[] files = Directory.GetFiles(fbd.SelectedPath, "*.html", SearchOption.AllDirectories);//change this to specify file type
                SaveFileDialog sfd = new SaveFileDialog();// Create save the CSV
                //sfd.Filter = "Text File|*.txt";// filters for text files only
                sfd.FileName = "Html Output.txt";
                sfd.Title = "Save Text File";
                if (sfd.ShowDialog() == DialogResult.OK)
                {
                    string path = sfd.FileName;
                    using (StreamWriter bw = new StreamWriter(File.Create(path)))
                    {
                        foreach (string f in files)
                        {

                            var html = new HtmlAgilityPack.HtmlDocument();
                            html.Load(f);
                            foreach (var table in html.DocumentNode.SelectNodes("//table").Skip(1).Take(1))//specify which tag your looking for
                            {
                                foreach (var td in table.SelectNodes("//td"))// this is the sub tag
                                {
                                    bw.WriteLine(td.InnerText);// this will make a text fill of what you are looking for in the HTML files
                                }
                            }

                        }//ends loop of files

                        bw.Flush();
                        bw.Close();
                    }
                }
                MessageBox.Show("Files found: " + files.Count<string>().ToString());
            }
        }

        catch (UnauthorizedAccessException UAEx)
        {
            MessageBox.Show(UAEx.Message);
        }
        catch (PathTooLongException PathEx)
        {
            MessageBox.Show(PathEx.Message);
        }
    }//method ends
}

Parse badly formatted HTML for table data

2 Answers2