2

I want to parse a site's pages programmatically, and obviously I need to have full HTML code. However, a site can load only some master-page via direct URL, and once the master-page is loaded, it then receives its content via AJAX.

How could I load a page "like in a browser" to let it be loaded completely with all its scripts having their work done?

For example, if I use WebBrowser class to really open a page "like in a browser", its DocumentText property (which should represent DOM contents) only returns initial page without actual contents loaded via AJAX or like that (tested on google.com). That happens in a browsers too, and to see actual HTML I need to use developer tools.

UPDATED: the answer is found to be here, thanks to Vladimir Shmidt how to dynamically generate HTML code using .NET's WebBrowser or mshtml.HTMLDocument?

DocumentText isn't updating its contents after the "root" DOM was loaded, but Document property does.

Community
  • 1
  • 1
yaapelsinko
  • 639
  • 1
  • 10
  • 18

4 Answers4

0

have you heard about http://webkitdotnet.sourceforge.net/? moreover .net has WebBrowser component that can be used for

Vladimir Shmidt
  • 2,651
  • 1
  • 19
  • 21
0

How could I load a page "like in a browser" ... ?

The only sure way to do this is to actually load the page in a browser. This can be automated by using a tool like Selenium/WebDriver.

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • Well there is WebBrowser class to open it in a browser, I've updated my question about it. Still don't see "elegant" solution... – yaapelsinko Sep 18 '14 at 17:28
0

From the title, it seems that you want the completed HTML in you page after AJAX and javascript have retrieved or generated content. If this is the case, the browsers debugger (F12) will have this. In Chrome, look under the "Elements" tab.

Chris Barlow
  • 466
  • 4
  • 8
  • Yes exactly, but I need this to be loaded programmatically to parse it. Any chance to receive completed code from instance of WebBrowser? – yaapelsinko Sep 18 '14 at 17:29
0

There are a few solutions out there.

Main Logic:

  1. Request the Page
  2. Wait til Document is fully loaded (ReadyState = Completed)
  3. Get Document content

I guess one the simple ones is to use a WebControl where you navigate to your url and wait for the controls ready or complete state. After that you could start parsing.

Here on SO is a solution: htmlagilitypack and dynamic content issue

Community
  • 1
  • 1
Calvijn
  • 81
  • 6