2

I am using WebBrowser to render javascript on webpages to scrape the rendered source code, but after several page loads, the CPU usage spikes to 100% as well as the number of threads.

I'm assuming that the threads are not closing properly once the webpage has been rendered. I am trying to open the browser, extract the source code, and then close the browser and move to the next page.

I am able to get the rendered page, but this program doesn't make it very far before getting bogged down. I tried adding wb.Stop() but that didn't help. The memory doesn't seem to be the problem (stays at a constant 70% or so).

Here is my source code. using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Windows.Forms; using System.Threading;

namespace Abot.Demo
{
    // Threaded version
    public class HeadlessBrowser
    {
        private static string GeneratedSource { get; set; }
        private static string URL { get; set; }

        public static string GetGeneratedHTML(string url)
        {
            URL = url;

            Thread t = new Thread(new ThreadStart(WebBrowserThread));
            t.SetApartmentState(ApartmentState.STA);
            t.Start();
            t.Join();

            return GeneratedSource;
        }

        private static void WebBrowserThread()
        {
            WebBrowser wb = new WebBrowser();
            wb.Navigate(URL);

            wb.DocumentCompleted +=
                new WebBrowserDocumentCompletedEventHandler(
                    wb_DocumentCompleted);

            while (wb.ReadyState != WebBrowserReadyState.Complete);
                //Application.DoEvents();

            //Added this line, because the final HTML takes a while to show up
            GeneratedSource = wb.Document.Body.InnerHtml;

            wb.Dispose();
            wb.Stop();
        }

        private static void wb_DocumentCompleted(object sender,
            WebBrowserDocumentCompletedEventArgs e)
        {
            WebBrowser wb = (WebBrowser)sender;
            GeneratedSource = wb.Document.Body.InnerHtml;
        }

    }
}

Any suggestions would be appreciated.

Thanks.

Pankaj Makwana
  • 3,030
  • 6
  • 31
  • 47
JBaczuk
  • 13,886
  • 10
  • 58
  • 86
  • When a web browser turns my PC into a toaster it's usually the fault of runaway JavaScript. (Some Flash plug-ins have been culprits too.) Make sure your JS is doing what it needs to by stopping when you abandon the page. – Paul Sasik May 20 '14 at 20:34
  • Is this a WPF application, a winform application, an application with no UI whatsoever, or what? – Servy May 20 '14 at 20:37
  • @Servy: I see `using System.Windows.Forms;` at the top of the file so unless the OP did something weird with the project it's probably WinForms. – Paul Sasik May 20 '14 at 20:44
  • @PaulSasik That is to access the `WebBrowser` class itself. It's a question of whether this is the only winform code in his project or whether this is one small piece of an actual windows forms application. – Servy May 20 '14 at 20:45
  • I am testing it by running it in a console, but I am actually trying to run it in an MVC project. – JBaczuk May 20 '14 at 20:47

1 Answers1

1

WebBrowser is specifically designed to be used from inside a windows forms project. It is not designed to be used from outside a windows forms project.

Among other things, it is specifically designed to use an application loop, which would exist in pretty much any desktop GUI application. You don't have this, and this is of course causing problems for you because the browser leverages this for its event based style of programming.

A quick word to any future readers who happen to be reading this and which are actually creating a winforms, WPF, or other application that already has a message loop. Do not apply the following code. You should only ever have one message loop in your application. Creating several is setting yourself up for a nightmare.

Since you have no application loop you need to create a new application loop, specify some code to run within that application loop, allow it to pump messages, and then tear it down when you have gotten your result.

public static string GetGeneratedHTML(string url)
{
    string result = null;
    ThreadStart pumpMessages = () =>
    {
        EventHandler idleHandler = null;
        idleHandler = (s, e) =>
        {
            Application.Idle -= idleHandler;

            WebBrowser wb = new WebBrowser();
            wb.DocumentCompleted += (s2, e2) =>
            {
                result = wb.Document.Body.InnerHtml;
                wb.Dispose();
                Application.Exit();
            };
            wb.Navigate(url);
        };
        Application.Idle += idleHandler;
        Application.Run();
    };
    if (Thread.CurrentThread.GetApartmentState() == ApartmentState.STA)
        pumpMessages();
    else
    {
        Thread t = new Thread(pumpMessages);
        t.SetApartmentState(ApartmentState.STA);
        t.Start();
        t.Join();
    }
    return result;
}
Servy
  • 202,030
  • 26
  • 332
  • 449
  • Thanks Servy, I wasn't aware of that, this seems to be working now. – JBaczuk May 20 '14 at 21:26
  • @servy , it get error at `Application.Run()` with this error : `An unhandled exception of type 'System.InvalidOperationException' occurred in System.Windows.Forms.dll Additional information: Starting a second message loop on a single thread is not a valid operation. Use Form.ShowDialog instead.` , How can I fix this ? – Root Oct 13 '17 at 16:56
  • @Root125 If you already have a running message loop you have no reason to create a second one. You just need to create the webbrowser and use it like most typical usages of a webbrowser would be used. – Servy Oct 16 '17 at 13:09