27

Is there a way to view the generated source of a web page (the code after all AJAX calls and JavaScript DOM manipulations have taken place) from a C# application without opening up a browser from the code?

Viewing the initial page using a WebRequest or WebClient object works ok, but if the page makes extensive use of JavaScript to alter the DOM on page load, then these don't provide an accurate picture of the page.

I have tried using Selenium and Watin UI testing frameworks and they work perfectly, supplying the generated source as it appears after all JavaScript manipulations are completed. Unfortunately, they do this by opening up an actual web browser, which is very slow. I've implemented a selenium server which offloads this work to another machine, but there is still a substantial delay.

Is there a .Net library that will load and parse a page (like a browser) and spit out the generated code? Clearly, Google and Yahoo aren't opening up browsers for every page they want to spider (of course they may have more resources than me...).

Is there such a library or am I out of luck unless I'm willing to dissect the source code of an open source browser?

SOLUTION

Well, thank you everyone for you're help. I have a working solution that is about 10X faster then Selenium. Woo!

Thanks to this old article from beansoftware I was able to use the System.Windows.Forms.WebBrowser control to download the page and parse it, then give em the generated source. Even though the control is in Windows.Forms, you can still run it from Asp.Net (which is what I'm doing), just remember to add System.Window.Forms to your project references.

There are two notable things about the code. First, the WebBrowser control is called in a new thread. This is because it must run on a single threaded apartment.

Second, the GeneratedSource variable is set in two places. This is not due to an intelligent design decision :) I'm still working on it and will update this answer when I'm done. wb_DocumentCompleted() is called multiple times. First when the initial HTML is downloaded, then again when the first round of JavaScript completes. Unfortunately, the site I'm scraping has 3 different loading stages. 1) Load initial HTML 2) Do first round of JavaScript DOM manipulation 3) pause for half a second then do a second round of JS DOM manipulation.

For some reason, the second round isn't cause by the wb_DocumentCompleted() function, but it is always caught when wb.ReadyState == Complete. So why not remove it from wb_DocumentCompleted()? I'm still not sure why it isn't caught there and that's where the beadsoftware article recommended putting it. I'm going to keep looking into it. I just wanted to publish this code so anyone who's interested can use it. Enjoy!

using System.Threading;
using System.Windows.Forms;

public class WebProcessor
{
    private string GeneratedSource{ get; set; }
    private string URL { get; set; }

    public string GetGeneratedHTML(string url)
    {
        URL = url;

        Thread t = new Thread(new ThreadStart(WebBrowserThread));
        t.SetApartmentState(ApartmentState.STA);
        t.Start();
        t.Join();

        return GeneratedSource;
    }

    private void WebBrowserThread()
    {
        WebBrowser wb = new WebBrowser();
        wb.Navigate(URL);

        wb.DocumentCompleted += 
            new WebBrowserDocumentCompletedEventHandler(
                wb_DocumentCompleted);

        while (wb.ReadyState != WebBrowserReadyState.Complete)
            Application.DoEvents();

        //Added this line, because the final HTML takes a while to show up
        GeneratedSource= wb.Document.Body.InnerHtml;

        wb.Dispose();
    }

    private void wb_DocumentCompleted(object sender, 
        WebBrowserDocumentCompletedEventArgs e)
    {
        WebBrowser wb = (WebBrowser)sender;
        GeneratedSource= wb.Document.Body.InnerHtml;
    }
}
Kelly S. French
  • 12,198
  • 10
  • 63
  • 93
Michael La Voie
  • 27,772
  • 14
  • 72
  • 92
  • 1
    You could try to hack firebug sources. – Eugeniu Torica Aug 20 '09 at 18:08
  • My attempt would have been with Watin & friends as well. Great question! – orip Aug 20 '09 at 18:25
  • Try to run your code against "http://www.host.com/path/page.html?ast=3" or "http://gwt.google.com/samples/Showcase/Showcase.html". You will notice, that it doesn't fetch the proper HTML. Any ideas how to fix that? – Cosmo Aug 15 '10 at 16:21

3 Answers3

4

it is possibly using an instance of a browser (in you case: the ie control). you can easily use in your app and open a page. the control will then load it and process any javascript. once this is done you can access the controls dom object and get the "interpreted" code.

Niko
  • 6,133
  • 2
  • 37
  • 49
  • Wouldn't this still have the same speed problems as opening the browser? – Michael La Voie Aug 20 '09 at 18:46
  • since you want your code to be interpreted+parsed, the speed "problem" would be pretty the same (maybe a little less on cpu if you dont display the window + you have a little less overhead). As far as i remember you can also prevent the ocntrol from loading images thus reducing the load time even more. But thats the only way you can accomplish what you want i am afraid – Niko Aug 20 '09 at 19:00
  • Thanks for your help. I posted my final answer, but yours was what sent me in that direction. :D – Michael La Voie Aug 20 '09 at 23:33
2

Best way is using PhantomJs. That's Great. (sample of that is Article).

My solution is look like this:

var page = require('webpage').create();

page.open("https://sample.com", function(){
    page.evaluate(function(){
        var i = 0,
        oJson = jsonData,
        sKey;
        localStorage.clear();

        for (; sKey = Object.keys(oJson)[i]; i++) {
            localStorage.setItem(sKey,oJson[sKey])
        }
    });

    page.open("https://sample.com", function(){
        setTimeout(function(){
         page.render("screenshoot.png") 
            // Where you want to save it    
           console.log(page.content); //page source
            // You can access its content using jQuery
            var fbcomments = page.evaluate(function(){
                return $("body").contents().find(".content") 
            }) 
            phantom.exit();
        },10000)
    });     
});
Community
  • 1
  • 1
1

Theoretically yes, but, at present, no.

I don't think there is currently a product or OSS project that does this. Such a product would need to have it's own javascript interpreter and be able to accurately emulate the run-time environment and quirks of every browser it supports.

Given that you need something that accurately emulates the server + browser environment in order to produce the final page code, in the long run, I think that using a browser instance is the best way to accurately generate the page in its final state. This is especially true, when you consider that, after the page load completes, the page sources can still change over time in the browser from AJAX/javascript.

Jeff Leonard
  • 3,284
  • 7
  • 29
  • 27
  • You may be right, and thanks for the thought. I did find a Java library that may be what I need, but I'm still hoping for a .net solution. Surely someone else has needed this before me: http://stackoverflow.com/questions/857515/screen-scraping-from-a-web-page-with-a-lot-of-javascript/857630#857630 – Michael La Voie Aug 20 '09 at 18:32