HTML Agility Pack

Question

I have html tables in one webpage like

<table border=1>
    <tr><td>sno</td><td>sname</td></tr>
    <tr><td>111</td><td>abcde</td></tr>
    <tr><td>213</td><td>ejkll</td></tr>
</table>

<table border=1>
    <tr><td>adress</td><td>phoneno</td><td>note</td></tr>
    <tr><td>asdlkj</td><td>121510</td><td>none</td></tr>
    <tr><td>asdlkj</td><td>214545</td><td>none</td></tr>
</table>

Now from this webpage using html agility pack I want to extract the data of the column address and phone no only. It means for that I have find first in which table there is column address and phoneno.After finding that table I want to extract the data of that column address and phoneno what should I do ?

I can get the table. But after that what should I do don't understand.

And other thing : is feasible that we can extract data from the table through column name.

Duplicate of http://stackoverflow.com/questions/2422762/html-agility-pack — Mike Two, Mar 12 '10 at 10:31
@Harikrishna - that is the same question you asked yesterday. You do state that requirement in the question from yesterday. This is still a duplicate question. It really is easier if you make the origninal question more clear instead of adding new questions. You are more likely to get the answer you want. — Mike Two, Mar 12 '10 at 10:44
@Harikrishna - I understand your issue with the answer I gave and I don't disagree with that, but asking the same question again is not the way to get a better answer. — Mike Two, Mar 12 '10 at 11:48
@Mike Two Sir..Thank You Very Much Sir for helping for my previous question. — Harikrishna, Mar 12 '10 at 11:55
@Harikrishna - You are welcome, hopefully it was somewhat helpful. Good luck with your project. — Mike Two, Mar 12 '10 at 13:29

João Angelo · Accepted Answer · 2010-03-12T11:09:50.453

Here are some helper methods to help you parse HTML tables to DataTable instances. You can just iterate through the resulting DataTable array to find the one containing the columns you want. The code is coupled with the format of the tables in the HTML, in this case it obtains column information from the first row (<tr>). Also note that no error checking is performed, so this will break will tables that do not follow the format you specified.

Helper methods:

private static DataTable[] ParseAllTables(HtmlDocument doc)
{
    var result = new List<DataTable>();
    foreach (var table in doc.DocumentNode.Descendants("table"))
    {
        result.Add(ParseTable(table));
    }
    return result.ToArray();
}

private static DataTable ParseTable(HtmlNode table)
{
    var result = new DataTable();

    var rows = table.Descendants("tr");

    var header = rows.Take(1).First();
    foreach (var column in header.Descendants("td"))
    {
        result.Columns.Add(new DataColumn(column.InnerText, typeof(string)));
    }

    foreach (var row in rows.Skip(1))
    {
        var data = new List<string>();
        foreach (var column in row.Descendants("td"))
        {
            data.Add(column.InnerText);
        }
        result.Rows.Add(data.ToArray());
    }
    return result;
}

Usage example:

public static void Main(string[] args)
{
    string html = @"
        <html><head></head>
        <body><div>
            <table border=1>
                <tr><td>sno</td><td>sname</td></tr>
                <tr><td>111</td><td>abcde</td></tr>
                <tr><td>213</td><td>ejkll</td></tr>
            </table>
            <table border=1>
                <tr><td>adress</td><td>phoneno</td><td>note</td></tr>
                <tr><td>asdlkj</td><td>121510</td><td>none</td></tr>
                <tr><td>asdlkj</td><td>214545</td><td>none</td></tr>
            </table>
        </div></body>
        </html>";

    HtmlDocument doc = new HtmlDocument();

    doc.LoadHtml(html);

   DataTable addressAndPhones;
   foreach (var table in ParseAllTables(doc))
   {
       if (table.Columns.Contains("phoneno") && table.Columns.Contains("adress"))
       {
           // You found the address and phone number table
           addressAndPhones = table;
       }
   }
}

@Harikrishna, `Skip` and `Take` are defined in `System.Linq`. You need to add a using statement for that namespace. LINQ is not available in .NET 2.0. — João Angelo, Mar 12 '10 at 11:29
@Harikrishna, as I said the helper functions are highly coupled to a given HTML format. They work for the following example. If you have different inputs you'll have to adapt them to your needs. — João Angelo, Mar 12 '10 at 11:41
@Harikrishna, refer to point 3.10 of http://www.codeproject.com/KB/grid/practicalguidedatagrids2.aspx — João Angelo, Mar 12 '10 at 11:49
@Joao Angelo,When table has no tr tag like "/tr" then it does not parse that information perfectly so for that what I should do ? Like starting tr tag is there and new row starts with new tr starting tag without writing ending tr tag.Is there any option in html agility pack that can first clean the html page then parse the information. — Harikrishna, Mar 22 '10 at 12:09
@Joao Angelo..Please Refer my this question : http://stackoverflow.com/questions/2490765/which-is-the-best-html-tidy-pack-is-there-any-option-in-html-agility-pack-to-mak — Harikrishna, Mar 22 '10 at 12:12
@Joao Angelo..Thanks for the help.For the missing closing tag I am using right now html tidy pack because of no option in html agility pack. — Harikrishna, Mar 25 '10 at 11:25
@Joao Angelo,I have one major problem and trying to solve it since many days, it is sometimes html page may be like table does not start with column header what I want like table starts with another information and I want to skeep them but error comes like : **Sum of the columns' FillWeight values cannot exceed 65535.** — Harikrishna, Mar 30 '10 at 10:44
@Joao Angelo,What if the table tag is innermost like `
`.Then I want to extract the innermost table — Harikrishna, Mar 31 '10 at 08:46

score 1 · Answer 2 · answered Mar 12 '10 at 10:32

Loop through tablerows and get column values by index

int index = 0;
foreach(HtmlNode tablerow in table.SelectNodes("tr"))
{
    // skip the first row...
    if(index > 0)
    {
        // select first td element
        HtmlNode td1 = tablerow.SelectSingleNode("td[1]");
        if(td1 != null)
        {
            string address = td1.InnerText;
        }
    }
    index++;
}

If you can modify the webpage, you could use thead for header texts and tbody for actual values.

<table id="mytable">
    <thead><tr><td>Column1</td><td>Column2</td></tr></thead>
    <tbody>
        <tr><td>Value 1</td><td>Value 2</td></tr>
        <tr><td>Value 1</td><td>Value 2</td></tr>
    </tbody>
</table>

Then you wouldn't have to skip the first row.

foreach(HtmlNode tablerow in table.SelectNodes("/table[@id=\"mytable\"]/tbody/tr"))
{
    // ...
}

Have a look at some xpath tutorial, it's very useful with HtmlAgilityPack.

HTML Agility Pack

2 Answers2

Linked