0

I am trying to scrape data from a news article using HtmlAgilityPackage the link is as follows http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528

I have written the following code below to extract all the comments in this articles but for some reason my variable aTags is returning null value

Code:

var getHtmlWeb = new HtmlWeb();
        var document = getHtmlWeb.Load(txtinputurl.Text);
        var aTags =    document.DocumentNode.SelectNodes("//div[@class='com_user_text']");
        int counter = 1;
        if (aTags != null)
        {
            foreach (var aTag in aTags)
            {
                lbloutput.Text += lbloutput.Text + ". " + aTag.InnerHtml + "\t" + "<br />";
                counter++;
            }
        }

I have also used this XPath but still the same result //div[@class='newcomment_list']/ul/li/div[@class='headerwrap']/div[@class='com_user_text'] Please help me with the correct Xpath to Extract all the comments Searched all over the net but no solution.

user3818862
  • 85
  • 1
  • 9

1 Answers1

0

Do a 'View Source' on the page and search for com_user_text. The user comments don't appear at all. They are loaded via javascript after the page is loaded. So when you load the page content via getHtmlWeb.Load(), you don't get user comments.

As this answer says, HTML Agility is not a tool capable of emulating a browser and running javascript. Instead, you need something like WatiN that "allows programmatic access to web pages through a given browser engine and will load the full document."

Community
  • 1
  • 1
LarsH
  • 27,481
  • 8
  • 94
  • 152
  • P.S. Welcome to Stack Overflow. You seem to have posted this question twice... was that intentional? – LarsH Jul 14 '15 at 18:42
  • Sir you mean to say i wont be able to complete the task using HtmlAgilityPackage...and you advise me to use WaitN is this similar to HAP – user3818862 Jul 14 '15 at 19:04
  • @user3818862: It is similar, except that instead of simply parsing HTML into a tree structure and letting you select nodes from it (as HAP does), Watir/WatiN drives an actual web browser, which does much more... including, running Javascript, so that you can test dynamic pages. – LarsH Jul 14 '15 at 19:14
  • Hi Sir Hi @LarsH just wanted to know..is it possible to scrape content from the webpage after the javascript is loaded completely using Html Agility Pack – user3818862 Jul 15 '15 at 17:22
  • @user3818862: No, I don't think so. HAP merely fetches the HTML, parses it into a tree structure, and lets you select nodes from it, e.g. using XPath. It does not run Javascript code. See http://stackoverflow.com/a/11394830/423105 for more info. – LarsH Jul 15 '15 at 18:27