How to get fully loaded HTML page's code

Question

I want to parse a site's pages programmatically, and obviously I need to have full HTML code. However, a site can load only some master-page via direct URL, and once the master-page is loaded, it then receives its content via AJAX.

How could I load a page "like in a browser" to let it be loaded completely with all its scripts having their work done?

For example, if I use WebBrowser class to really open a page "like in a browser", its DocumentText property (which should represent DOM contents) only returns initial page without actual contents loaded via AJAX or like that (tested on google.com). That happens in a browsers too, and to see actual HTML I need to use developer tools.

UPDATED: the answer is found to be here, thanks to Vladimir Shmidt how to dynamically generate HTML code using .NET's WebBrowser or mshtml.HTMLDocument?

DocumentText isn't updating its contents after the "root" DOM was loaded, but Document property does.

score 0 · Accepted Answer · answered Sep 18 '14 at 17:21

0

have you heard about http://webkitdotnet.sourceforge.net/? moreover .net has WebBrowser component that can be used for

answered Sep 18 '14 at 17:21

Vladimir Shmidt

2,651
1
19
21

Yes, it has, I've updated my question just after your comment, please look into it. – yaapelsinko Sep 18 '14 at 17:24
whould DocumentCompleted (WebBrowserDocumentCompletedEventHandler) event in WebBrowser be enought for the point when ALL site loaded even via ajax? – Vladimir Shmidt Sep 18 '14 at 17:27
Hummm, I'll go look into it... – yaapelsinko Sep 18 '14 at 17:30
well. after some googling i've found http://stackoverflow.com/questions/20930414/how-to-dynamically-generate-html-code-using-nets-webbrowser-or-mshtml-htmldocu/20934538#20934538 so your question can be marked as duplicate – Vladimir Shmidt Sep 18 '14 at 17:31
Yeah, great. Thank you, it always about asking google the right question. – yaapelsinko Sep 18 '14 at 17:39

score 0 · Answer 2 · answered Sep 18 '14 at 17:22

0

How could I load a page "like in a browser" ... ?

The only sure way to do this is to actually load the page in a browser. This can be automated by using a tool like Selenium/WebDriver.

answered Sep 18 '14 at 17:22

StriplingWarrior

151,543
27
246
315

Well there is WebBrowser class to open it in a browser, I've updated my question about it. Still don't see "elegant" solution... – yaapelsinko Sep 18 '14 at 17:28

score 0 · Answer 3 · answered Sep 18 '14 at 17:26

0

From the title, it seems that you want the completed HTML in you page after AJAX and javascript have retrieved or generated content. If this is the case, the browsers debugger (F12) will have this. In Chrome, look under the "Elements" tab.

answered Sep 18 '14 at 17:26

Chris Barlow

466
4
8

Yes exactly, but I need this to be loaded programmatically to parse it. Any chance to receive completed code from instance of WebBrowser? – yaapelsinko Sep 18 '14 at 17:29

score 0 · Answer 4 · edited May 23 '17 at 12:18

0

There are a few solutions out there.

Main Logic:

Request the Page
Wait til Document is fully loaded (ReadyState = Completed)
Get Document content

I guess one the simple ones is to use a WebControl where you navigate to your url and wait for the controls ready or complete state. After that you could start parsing.

Here on SO is a solution: htmlagilitypack and dynamic content issue

edited May 23 '17 at 12:18

Community

1
1

answered Sep 18 '14 at 17:42

Calvijn

81
6

How to get fully loaded HTML page's code

4 Answers4