0

I have a table like this:

<table border="0" cellpadding="0" cellspacing="0" id="table2">
    <tr>
        <th>Name
        </th>
        <th>Age
        </th>
    </tr>
        <tr>
        <td>Mario
        </td>
        <th>Age: 78
        </td>
    </tr>
            <tr>
        <td>Jane
        </td>
        <td>Age: 67
        </td>
    </tr>
            <tr>
        <td>James
        </td>
        <th>Age: 92
        </td>
    </tr>
</table>

and I am using html agility pack to parse it. I have tried this code but it is not returning expected results: Here is the code:

foreach (HtmlNode tr in doc.DocumentNode.SelectNodes("//table[@id='table2']//tr"))
            {
                //looping on each row, get col1 and col2 of each row
                HtmlNodeCollection tds = tr.SelectNodes("td");
                for (int i = 0; i < tds.Count; i++)
                {
                    Response.Write(tds[i].InnerText);
                }
            }

I am getting each column because I would like to do some processing on the contents returned.

What am I doing wrong?

mpora
  • 1,411
  • 5
  • 24
  • 65

2 Answers2

1

You can grab the cell content from within your outer foreach loop:

foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//table[@id='table2']//tr//td"))  
{  
    Response.Write(td.InnerText);   
}  

Also I'd recommend trimming and 'de-entitizing the inner text to ensure it is clean:

Response.Write(HtmlEntity.DeEntitize(td.InnerText).Trim())

In your source the cells for [Age: 78] and [Age: 92] have a <th> tag at the start instead of <td>

0

This is my solution. Please notice your HTML is not well formatted because you have TH where TD should be:

<table border="0" cellpadding="0" cellspacing="0" id="table2">
    <tr>
        <th>Name
        </th>
        <th>Age
        </th>
    </tr>
        <tr>
        <td>Mario
        </td>
        <td>Age: 78
        </td>
    </tr>
            <tr>
        <td>Jane
        </td>
        <td>Age: 67
        </td>
    </tr>
            <tr>
        <td>James
        </td>
        <td>Age: 92
        </td>
    </tr>
</table>

And this is the c# Code:

using HtmlAgilityPack;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {

            HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
            document.Load("page.html");

            List<HtmlNode> x = document.GetElementbyId("table2").Elements("tr").ToList();

            foreach (HtmlNode node in x)
            {
                List<HtmlNode> s = node.Elements("td").ToList();
                foreach (HtmlNode item in s)
                {
                    Console.WriteLine("TD Value: " + item.InnerText);
                }
            }
            Console.ReadLine();
        }
    }
}

Screenshot: enter image description here

Edit: I must add that if you are going to use the <th> tags you must include them inside a <thead> tag, and then your rows inside of a <tbody> tag so that your html is well formatted :)

More info: http://www.w3schools.com/tags/tag_thead.asp

Hanlet Escaño
  • 17,114
  • 8
  • 52
  • 75
  • I solved it before coming back. I am onto applying regular expressions now to extract the age number and creating a csv file that would have the name and age (i.e: name,age). – mpora Feb 20 '13 at 23:55
  • Thank you. HTML agility pack has sped my progress. – mpora Feb 21 '13 at 00:36
  • 3
    FYI using regular expression to parse html is usually bad idea http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – Andriy F. Feb 21 '13 at 09:10
  • I am in a .NET shop, what do you suggest? The article you provided a link to suggests an alternative but it is no where to be found. – mpora Feb 21 '13 at 16:08