-1
WebClient client = new WebClient();
var data = client.DownloadString("a web link");

and i am getting an HTML page in which there's a table like this

<table>
<tr>
   <td> Team 1 ID </td>
   <td> Team 1 Name </td>
   <td>
       <table>
        <tr>
           <td> Member 1 name </td>
           <td> Member 1 age </td>
        </tr>
        <tr>
           <td> Member 2 name </td>
           <td> Member 2 age </td>
        </tr>
        </table>
    </td>
</tr>
<tr>
   <td> Team 2 ID </td>
   <td> Team 2 Name </td>
   <td>
       <table>
        <tr>
           <td> Member 1 name </td>
           <td> Member 1 age </td>
        </tr>
        </table>
    </td>
</tr>

that means another table in each row of main table so i called it nested table. whatever, now i want to get these data into class like this

class Team
{

    public int teamID;
    public string teamName;
    public struct Member
    {
        public string memberName;
        public int memberAge;
    }

    public Member member1;
    public Member member2;
}

note that, each team might have 0 to 3 members

so i am seeking for a sound solution that can solve my problem. should i use RegEx or HtmlAgilityPack or which way is appropriate and how? thanks in advance

Raihan Al-Mamun
  • 337
  • 6
  • 11

1 Answers1

0

Just use HtmlAgilityPack. If you run into any troubles, I can help you.

Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

Using regular expressions to parse HTML: why not?

It will be easier if your html contains any identifiers (css classes or id)

Updated code: Here is my suggestion to approach your problem

        string mainURL = "your url";
        HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = web.Load(mainURL);

        var tables = doc.DocumentNode.Descendants("table").Where(_ => _.Descendants("table").Any());//this will give you all tables which contain another table inside
        foreach (var table in tables)
        {
            var rows = table.ChildNodes.Where(_ => _.Name.Equals("tr"));//get all tr children (not grand children)
            foreach (var row in rows)
            {
                for (int i = 0; i < row.ChildNodes.Count; i++)
                {
                    if (row.ChildNodes[i].Name.Equals("td"))
                    {
                        //you can put your logic here, for eg i == 0, assign it to TeamID properties etc...
                    }
                    if (row.ChildNodes[i].Name.Equals("table"))
                    {
                        //here is your logic to handle nested table
                    }
                }
            }
        }
Community
  • 1
  • 1
Hung Cao
  • 3,130
  • 3
  • 20
  • 29
  • each table has same class so can't identify separately. okay i am trying but it will be great if you give me some code solution for this problem. – Raihan Al-Mamun Apr 12 '17 at 19:43
  • Just provide your html with all available attributes, I can point you to the right direction – Hung Cao Apr 12 '17 at 19:48
  • Class ID Status Time 00001 Open
    8:00 AM 9:30 AM
    8:00 AM 9:30 AM
    – Raihan Al-Mamun Apr 12 '17 at 19:59
  • this is the actual table structure. original table has 1600/+ row and another table in each row! due to authentication issue, i can't give you page link :/ – Raihan Al-Mamun Apr 12 '17 at 20:00
  • sorry to say but this code is not working. though its going through all the , i have checked using counter but not getting proper data in my class. – Raihan Al-Mamun Apr 13 '17 at 10:51
  • problem might be somewhere else like as i am using webclient.DownloadString("mylink") so i found that it is also getting \r . \n and lots of spaces. so i am getting null value in class properties!! what can i do now? pls help. i guess we are close enough – Raihan Al-Mamun Apr 13 '17 at 10:53
  • update: but if i use i==1, i==3 that means i == odd , then i am getting it right!! – Raihan Al-Mamun Apr 13 '17 at 11:14
  • you can use `node.GetAttributeValue("class","")` to get value of class attr, it will return empty in case there is no class attr. – Hung Cao Apr 13 '17 at 17:19