C# extract content of a div from a string

Question

I need to extract the content tr and td of a 2nd table inside the div from the external url. I can't use HtmlAglitityPack.

Design is something like this:

<div class="class1" id="content-main">
      <table width="90%">
        <tbody>
          <tr><td class="table_left_corner">&nbsp;</td><td class="table_head">table1 </td><td class="table_right_corner">&nbsp;</td></tr>
        </tbody>
      </table>
      <table width="90%">
        <tbody>
          <tr><td class="table_left_corner">&nbsp;</td><td class="table_head">table2</td><td class="table_right_corner">&nbsp;</td></tr>
        </tbody>
      </table>
      <table width="90%">
        <tbody>
          <tr><td class="table_left_corner">&nbsp;</td><td class="table_head">table3 </td><td class="table_right_corner">&nbsp;</td></tr>
        </tbody>
      </table>
</div>

So I want to use some Regex functions to return the content of a table.

 using (WebClient client = new WebClient())
 {
    string htmlcode= client.DownloadString("http://www.example.com");

    string r = @"<div.*?id=""content-main"".*?>.*</div>";       

    Match match2 = Regex.Match(htmlcode, r);

    string a = match2.Groups[1].Value;
 }

I use different regex expression but all are failed. so please help. how can I get content of a 2nd table.

Edit 2 By using HTMLAglityPack

    var web = new HtmlWeb();
    var document = web.Load("http://www.example.com/");
    var page = document.DocumentNode;


string outerHTML = page.SelectNodes("//table")[5].OuterHtml;
    Match match1;
    match1 = Regex.Match(outerHTML, @"<a [^>]+>(.*?)<\/a>");

        while (match1.Success)
        {
            string NAme = match1.Groups[1].Value;                      

            var webloc = new HtmlWeb();
            dynamic documentloc = null;
            documentloc = webloc.Load(urlAddress + NAme.Replace(" ", "-").ToLower());
            dynamic pageloc = documentloc.DocumentNode;

            string outerHTMLloc = pageloc.SelectNodes("//table")[5].OuterHtml;

            match1 = match1.NextMatch();                               
        }

First time it run successfully but when second time come it throws an error on "outerHTMLloc"

Error:"An unhandled exception of type 'System.StackOverflowException' occurred in HtmlAgilityPack.DLL"

Any reason you don't want to use an HTML parser such as Html Agility Pack or some other alternatives? Were you aware of this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 That's what awaits you if you attempt to parse HTML with a regular expression. The sooner you quit doing it, the sooner you will feel the relief. — Darin Dimitrov, Dec 29 '13 at 17:28
@DarinDimitrov - thanks for the link.. but any alternative idea. — ankit Gupta, Dec 29 '13 at 17:35
No, I have no alternate ideas. I really don't want to contribute to StackOverflow with wrong code. Can you imagine someone having the same problem as you and stumble upon this thread? Nobody wants the poor person to make such mistakes like parsing HTML with regular expressions. — Darin Dimitrov, Dec 29 '13 at 17:38
Using regex won't give any performance gain as you might think and it **won't** work for arbitrary nested tags .I would argue that htmlagilitypack is faster & more efficient then regex specially when there are a lot of nested tags — Anirudha, Dec 29 '13 at 17:38
@Anirudh : I have no issue for performance because this is for internal use only. I have to create a crawler where there is lots of url and HtmlAglityPack is failed in looping. — ankit Gupta, Dec 29 '13 at 17:40
`HtmlAgilityPack is failed in looping`??? Has it occurred to you that maybe you are using it incorrectly? If you showed your code here (the one with HTML Agility Pack) we might be able to see what errors you might have made and why it is not working. But using regular expressions to parse HTML because *HTML Agility Pack is failed in looping* is the most stupid explanation I've ever heard. — Darin Dimitrov, Dec 29 '13 at 17:43
@DarinDimitrov : Please check the code Now and tell me what wrong in this code. — ankit Gupta, Dec 29 '13 at 18:04

Anirudha · Answer 1 · 2013-12-29T18:03:34.170

0

Assuming you want the value of the second table inside div with id 'content-main' your code should be:

string value=pageloc.SelectNodes("//div[@id='content-main']//table")   //select all table tags inside div tag
                    .Skip(1)   //skip the first table
                    .First()    //take the second table
                    .InnerHtml;

edited Dec 29 '13 at 18:03

answered Dec 29 '13 at 17:57

Anirudha

32,393
7
68
89

The problem is when it goes in while loop , first time it run successfully but when it goes to second time and change the url it throws the error. – ankit Gupta Dec 29 '13 at 18:08
@ankitGupta why are you using while loop?Above code would do it without using while loop assuming I understood your question – Anirudha Dec 29 '13 at 18:10
while loop is used for changing the page url. – ankit Gupta Dec 29 '13 at 18:12

C# extract content of a div from a string

1 Answers1