0

Context:

I'm developing a desktop application in C# to scrape / analyse product information from individual web pages in a small number of domains. I use HtmlAgilityPack to capture and parse pages to fetch the data needed. I code different parsing rules for different domains.

Issue:

Pages from one particular domain, when displayed through a browser, can show perhaps 60-80 products. However when I parse through HtmlAgilityPack I only get 20 products maximum. Looking at the raw html in Firefox "View Page Source" there also appears to be only 20 of the relevant product divs present. I conclude that the remaining products must be loaded in via a script, perhaps to ease the load on the server. Indeed I can sometimes see this happening in the browser as there is a short pause while 20 more products load, then another 20 etc.

Question:

How can I access, through HtmlAgilityPack or otherwise, the full set of product divs present once all the scripting is complete?

ifinlay
  • 621
  • 1
  • 7
  • 24

2 Answers2

0

You could use the WebBrowser in System.Windows.Forms to load the data, and agility pack to parse it. It would look something like this :

 var browser = new WebBrowser();
 browser.Navigate("http://whatever.com");

  while (true)
  {
      if(browser.ReadyState == WebBrowserReadyState.Complete && browser.IsBusy != true)
      {
        break;
      }
        //not for production
        Thread.Sleep(1000)
  }


  var doc = new HtmlAgilityPack.HtmlDocument();
  var dom = (IHTMLDocument3)browser.Document.DomDocument; 
  StringReader reader = new StringReader(dom.documentElement.outerHTML); 
  doc.Load(reader);

see here for more details

Community
  • 1
  • 1
swestner
  • 1,881
  • 15
  • 19
  • Thanks @swestner - Looks promising. Unfortunately I'm doing a WPF application so System.Windows.Forms isn't immediately available to me but I suspect there is a work around for that which I'll look into. In the meantime I'm knocking something similar together using the Selenium package which I've just discovered. I'll post the outcome of that below. – ifinlay Dec 10 '15 at 20:37
0

Ok, I've got something working using the Selenium package (available via NuGet). The code looks like this:

    private HtmlDocument FetchPageWithSelenium(string url)
    {
        IWebDriver driver = new FirefoxDriver();
        IJavaScriptExecutor js = (IJavaScriptExecutor)driver;

        driver.Navigate().GoToUrl(url);

       // Scroll to the bottom of the page and pause for more products to load.
       // Do it four times as there may be 4x20 products to retrieve.
        js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
        Thread.Sleep(2000);
        js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
        Thread.Sleep(2000);
        js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");
        Thread.Sleep(2000);
        js.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);");

        HtmlDocument webPage = new HtmlDocument();
        webPage.LoadHtml(driver.PageSource.ToString());

        driver.Quit();

        return webPage;
    }

This returns an HtmlAgilityPack HtmlDocument ready for further analysis having first forced the page to fully load by repeatedly scrolling to the bottom. Two issues outstanding:

  1. The code launches Firefox and then stops it again when complete. That's a bit clumsy and I'd rather all that happened invisibly. It's suggested that you can avoid this by using a PhantomJS driver instead of the Firefox driver. This didn't help though as it just pops up a Windows console window instead.
  2. It's a bit slow due to the time taken to load the browser and pause while the scripting loads the supplementary content. I can probably live with it though.

I'll try to rework the @swestner code as well to get it running in a WPF app and see which is the tidier solution.

ifinlay
  • 621
  • 1
  • 7
  • 24