1

If I recive a web site with this function I get the whole page, but without the ajax loaded values.

htmlDoc.LoadHtml(new WebClient().DownloadString(url));

Is it possible to load the web site like in gChrome with all values?

Philipp Nies
  • 945
  • 4
  • 20
  • 38

2 Answers2

3

You can use a WebBrowser control to get and render the page. Unfortunately, the control uses Internet Explorer and you have to change a registry value in order to force it to use the latest version and even then the implementation is very brittle.

Another option is to take a standalone browser engine like WebKit and make it work in .NET. I found a page explaining how to do this, but it's pretty dated: http://webkitdotnet.sourceforge.net/basics.php

I worked on a little demo app to get the content and this is what I came up with:

    class Program
    {
        static void Main(string[] args)
        {
            GetRenderedWebPage("https://siderite.dev", TimeSpan.FromSeconds(5), output =>
            {
                Console.Write(output);
                File.WriteAllText("output.txt", output);
            });
            Console.ReadKey();
        }

        private static void GetRenderedWebPage(string url, TimeSpan waitAfterPageLoad, Action<string> callBack)
        {
            const string cEndLine= "All output received";

            var sb = new StringBuilder();
            var p = new PhantomJS();
            p.OutputReceived += (sender, e) =>
            {
                if (e.Data==cEndLine)
                {
                    callBack(sb.ToString());
                } else
                {
                    sb.AppendLine(e.Data);
                }
            };
            p.RunScript(@"
var page = require('webpage').create();
page.viewportSize = { width: 1920, height: 1080 };
page.onLoadFinished = function(status) {
    if (status=='success') {
        setTimeout(function() {
            console.log(page.content);
            console.log('" + cEndLine + @"');
            phantom.exit();
        }," + waitAfterPageLoad.TotalMilliseconds + @");
    }
};
var url = '" + url + @"';
page.open(url);", new string[0]);
        }
    }

This uses the PhantomJS "headless" browser by way of the wrapper NReco.PhantomJS which you can get through "reference NuGet package" directly from Visual Studio. I am sure it can be done better, but this is what I did today. You might want to take a look at the PhantomJS callbacks so you can properly debug what is going on. My example will wait forever if the URL doesn't work, for example. Here is a useful link: https://newspaint.wordpress.com/2013/04/25/getting-to-the-bottom-of-why-a-phantomjs-page-load-fails/

Siderite Zackwehdex
  • 6,293
  • 3
  • 30
  • 46
  • A browser engine look like a good idea, the default IE8? browser in c# is not the best choice for my project. Before I try out the WebKit engine, do you know if I can block every graphic from the web site. I need to load the web site as fast as I can. – Philipp Nies Mar 23 '16 at 10:50
  • As for the blocking, take a look at the onResourceRequested PhantomJS event. Maybe it has some sort of cancellation mechanism. However, consider that based on the size of pictures the page might render differently. – Siderite Zackwehdex Mar 23 '16 at 11:38
  • I've tested a lot of jQ webpages, it's works awesome. Thanks a lot for your code example and the PhantomJS advice. – Philipp Nies Mar 25 '16 at 09:10
2

No its not possible in your example. Since it will load content as a string. You should render that string in "browser engine" or find any components which would do that for you.

I would suggest you to look into abotx they just announce this feature so maybe would be interesting for you but its not free.

Vova Bilyachat
  • 18,765
  • 4
  • 55
  • 80