14

I am using this method to instantiate a web browser programmatically, navigate to a url and return a result when the document has completed.

How would I be able to stop the Task and have GetFinalUrl() return null if the document takes more than 5 seconds to load?

I have seen many examples using a TaskFactory but I haven't been able to apply it to this code.

 private Uri GetFinalUrl(PortalMerchant portalMerchant)
    {
        SetBrowserFeatureControl();
        Uri finalUri = null;
        if (string.IsNullOrEmpty(portalMerchant.Url))
        {
            return null;
        }
        Uri trackingUrl = new Uri(portalMerchant.Url);
        var task = MessageLoopWorker.Run(DoWorkAsync, trackingUrl);
        task.Wait();
        if (!String.IsNullOrEmpty(task.Result.ToString()))
        {
            return new Uri(task.Result.ToString());
        }
        else
        {
            throw new Exception("Parsing Failed");
        }
    }

// by Noseratio - http://stackoverflow.com/users/1768303/noseratio    

static async Task<object> DoWorkAsync(object[] args)
{
    _threadCount++;
    Console.WriteLine("Thread count:" + _threadCount);
    Uri retVal = null;
    var wb = new WebBrowser();
    wb.ScriptErrorsSuppressed = true;

    TaskCompletionSource<bool> tcs = null;
    WebBrowserDocumentCompletedEventHandler documentCompletedHandler = (s, e) => tcs.TrySetResult(true);

    foreach (var url in args)
    {
        tcs = new TaskCompletionSource<bool>();
        wb.DocumentCompleted += documentCompletedHandler;
        try
        {
            wb.Navigate(url.ToString());
            await tcs.Task;
        }
        finally
        {
            wb.DocumentCompleted -= documentCompletedHandler;
        }

        retVal = wb.Url;
        wb.Dispose();
        return retVal;
    }
    return null;
}

public static class MessageLoopWorker
{
    #region Public static methods

    public static async Task<object> Run(Func<object[], Task<object>> worker, params object[] args)
    {
        var tcs = new TaskCompletionSource<object>();

        var thread = new Thread(() =>
        {
            EventHandler idleHandler = null;

            idleHandler = async (s, e) =>
            {
                // handle Application.Idle just once
                Application.Idle -= idleHandler;

                // return to the message loop
                await Task.Yield();

                // and continue asynchronously
                // propogate the result or exception
                try
                {
                    var result = await worker(args);
                    tcs.SetResult(result);
                }
                catch (Exception ex)
                {
                    tcs.SetException(ex);
                }

                // signal to exit the message loop
                // Application.Run will exit at this point
                Application.ExitThread();
            };

            // handle Application.Idle just once
            // to make sure we're inside the message loop
            // and SynchronizationContext has been correctly installed
            Application.Idle += idleHandler;
            Application.Run();
        });

        // set STA model for the new thread
        thread.SetApartmentState(ApartmentState.STA);

        // start the thread and await for the task
        thread.Start();
        try
        {
            return await tcs.Task;
        }
        finally
        {
            thread.Join();
        }
    }
    #endregion
}
noseratio
  • 59,932
  • 34
  • 208
  • 486
Dan Cook
  • 1,935
  • 7
  • 26
  • 50
  • 1
    Nice to see someone is actually using [this code](http://stackoverflow.com/a/19737374/1768303) :) I have another example that does a similar thing with a timeout: http://stackoverflow.com/a/21152965/1768303. Look for var `cts = new CancellationTokenSource(30000)`. – noseratio Mar 07 '14 at 01:40
  • Thanks. Do you have an example of how to do this in a console app by any chance? Also I don't think webBrowser can be a class variable because I am running the whole thing in a parallell for each, iterating thousands of URLs – Dan Cook Mar 07 '14 at 18:30
  • I used the code you suggested in my console app and got: System.Threading.ThreadStateException: ActiveX control '8856f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current thread is not in a single-threaded apartment. Which I guess is what the message loop worker thread does in your other code sample. Which is what I could not get working with the cancellationToken. Help appreciated. I will keep trying. – Dan Cook Mar 07 '14 at 18:52
  • It seems like not only does it need to be run on an STA thread but also needs a message loop worker as at: http://stackoverflow.com/a/19737374/1768303 – Dan Cook Mar 07 '14 at 20:43

3 Answers3

25

Updated: the latest version of the WebBrowser-based console web scraper can be found on Github.

Updated: Adding a pool of WebBrowser objects for multiple parallel downloads.

Do you have an example of how to do this in a console app by any chance? Also I don't think webBrowser can be a class variable because I am running the whole thing in a parallell for each, iterating thousands of URLs

Below is an implementation of more or less generic **WebBrowser-based web scraper **, which works as console application. It's a consolidation of some of my previous WebBrowser-related efforts, including the code referenced in the question:

A few points:

  • Reusable MessageLoopApartment class is used to start and run a WinForms STA thread with its own message pump. It can be used from a console application, as below. This class exposes a TPL Task Scheduler (FromCurrentSynchronizationContext) and a set of Task.Factory.StartNew wrappers to use this task scheduler.

  • This makes async/await a great tool for running WebBrowser navigation tasks on that separate STA thread. This way, a WebBrowser object gets created, navigated and destroyed on that thread. Although, MessageLoopApartment is not tied up to WebBrowser specifically.

  • It's important to enable HTML5 rendering using Browser Feature Control, as otherwise the WebBrowser obejcts runs in IE7 emulation mode by default. That's what SetFeatureBrowserEmulation does below.

  • It may not always be possible to determine when a web page has finished rendering with 100% probability. Some pages are quite complex and use continuous AJAX updates. Yet we can get quite close, by handling DocumentCompleted event first, then polling the page's current HTML snapshot for changes and checking the WebBrowser.IsBusy property. That's what NavigateAsync does below.

  • A time-out logic is present on top of the above, in case the page rendering is never-ending (note CancellationTokenSource and CreateLinkedTokenSource).

using Microsoft.Win32;
using System;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace Console_22239357
{
    class Program
    {
        // by Noseratio - https://stackoverflow.com/a/22262976/1768303

        // main logic
        static async Task ScrapeSitesAsync(string[] urls, CancellationToken token)
        {
            using (var apartment = new MessageLoopApartment())
            {
                // create WebBrowser inside MessageLoopApartment
                var webBrowser = apartment.Invoke(() => new WebBrowser());
                try
                {
                    foreach (var url in urls)
                    {
                        Console.WriteLine("URL:\n" + url);

                        // cancel in 30s or when the main token is signalled
                        var navigationCts = CancellationTokenSource.CreateLinkedTokenSource(token);
                        navigationCts.CancelAfter((int)TimeSpan.FromSeconds(30).TotalMilliseconds);
                        var navigationToken = navigationCts.Token;

                        // run the navigation task inside MessageLoopApartment
                        string html = await apartment.Run(() =>
                            webBrowser.NavigateAsync(url, navigationToken), navigationToken);

                        Console.WriteLine("HTML:\n" + html);
                    }
                }
                finally
                {
                    // dispose of WebBrowser inside MessageLoopApartment
                    apartment.Invoke(() => webBrowser.Dispose());
                }
            }
        }

        // entry point
        static void Main(string[] args)
        {
            try
            {
                WebBrowserExt.SetFeatureBrowserEmulation(); // enable HTML5

                var cts = new CancellationTokenSource((int)TimeSpan.FromMinutes(3).TotalMilliseconds);

                var task = ScrapeSitesAsync(
                    new[] { "http://example.com", "http://example.org", "http://example.net" },
                    cts.Token);

                task.Wait();

                Console.WriteLine("Press Enter to exit...");
                Console.ReadLine();
            }
            catch (Exception ex)
            {
                while (ex is AggregateException && ex.InnerException != null)
                    ex = ex.InnerException;
                Console.WriteLine(ex.Message);
                Environment.Exit(-1);
            }
        }
    }

    /// <summary>
    /// WebBrowserExt - WebBrowser extensions
    /// by Noseratio - https://stackoverflow.com/a/22262976/1768303
    /// </summary>
    public static class WebBrowserExt
    {
        const int POLL_DELAY = 500;

        // navigate and download 
        public static async Task<string> NavigateAsync(this WebBrowser webBrowser, string url, CancellationToken token)
        {
            // navigate and await DocumentCompleted
            var tcs = new TaskCompletionSource<bool>();
            WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
                tcs.TrySetResult(true);

            using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
            {
                webBrowser.DocumentCompleted += handler;
                try
                {
                    webBrowser.Navigate(url);
                    await tcs.Task; // wait for DocumentCompleted
                }
                finally
                {
                    webBrowser.DocumentCompleted -= handler;
                }
            }

            // get the root element
            var documentElement = webBrowser.Document.GetElementsByTagName("html")[0];

            // poll the current HTML for changes asynchronosly
            var html = documentElement.OuterHtml;
            while (true)
            {
                // wait asynchronously, this will throw if cancellation requested
                await Task.Delay(POLL_DELAY, token);

                // continue polling if the WebBrowser is still busy
                if (webBrowser.IsBusy)
                    continue;

                var htmlNow = documentElement.OuterHtml;
                if (html == htmlNow)
                    break; // no changes detected, end the poll loop

                html = htmlNow;
            }

            // consider the page fully rendered 
            token.ThrowIfCancellationRequested();
            return html;
        }

        // enable HTML5 (assuming we're running IE10+)
        // more info: https://stackoverflow.com/a/18333982/1768303
        public static void SetFeatureBrowserEmulation()
        {
            if (System.ComponentModel.LicenseManager.UsageMode != System.ComponentModel.LicenseUsageMode.Runtime)
                return;
            var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
            Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION",
                appName, 10000, RegistryValueKind.DWord);
        }
    }

    /// <summary>
    /// MessageLoopApartment
    /// STA thread with message pump for serial execution of tasks
    /// by Noseratio - https://stackoverflow.com/a/22262976/1768303
    /// </summary>
    public class MessageLoopApartment : IDisposable
    {
        Thread _thread; // the STA thread

        TaskScheduler _taskScheduler; // the STA thread's task scheduler

        public TaskScheduler TaskScheduler { get { return _taskScheduler; } }

        /// <summary>MessageLoopApartment constructor</summary>
        public MessageLoopApartment()
        {
            var tcs = new TaskCompletionSource<TaskScheduler>();

            // start an STA thread and gets a task scheduler
            _thread = new Thread(startArg =>
            {
                EventHandler idleHandler = null;

                idleHandler = (s, e) =>
                {
                    // handle Application.Idle just once
                    Application.Idle -= idleHandler;
                    // return the task scheduler
                    tcs.SetResult(TaskScheduler.FromCurrentSynchronizationContext());
                };

                // handle Application.Idle just once
                // to make sure we're inside the message loop
                // and SynchronizationContext has been correctly installed
                Application.Idle += idleHandler;
                Application.Run();
            });

            _thread.SetApartmentState(ApartmentState.STA);
            _thread.IsBackground = true;
            _thread.Start();
            _taskScheduler = tcs.Task.Result;
        }

        /// <summary>shutdown the STA thread</summary>
        public void Dispose()
        {
            if (_taskScheduler != null)
            {
                var taskScheduler = _taskScheduler;
                _taskScheduler = null;

                // execute Application.ExitThread() on the STA thread
                Task.Factory.StartNew(
                    () => Application.ExitThread(),
                    CancellationToken.None,
                    TaskCreationOptions.None,
                    taskScheduler).Wait();

                _thread.Join();
                _thread = null;
            }
        }

        /// <summary>Task.Factory.StartNew wrappers</summary>
        public void Invoke(Action action)
        {
            Task.Factory.StartNew(action,
                CancellationToken.None, TaskCreationOptions.None, _taskScheduler).Wait();
        }

        public TResult Invoke<TResult>(Func<TResult> action)
        {
            return Task.Factory.StartNew(action,
                CancellationToken.None, TaskCreationOptions.None, _taskScheduler).Result;
        }

        public Task Run(Action action, CancellationToken token)
        {
            return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler);
        }

        public Task<TResult> Run<TResult>(Func<TResult> action, CancellationToken token)
        {
            return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler);
        }

        public Task Run(Func<Task> action, CancellationToken token)
        {
            return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler).Unwrap();
        }

        public Task<TResult> Run<TResult>(Func<Task<TResult>> action, CancellationToken token)
        {
            return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler).Unwrap();
        }
    }
}
noseratio
  • 59,932
  • 34
  • 208
  • 486
  • 1
    Thank you Noseratio. This code worked perfectly as-is and I could adapt it to fit my needs easily. I am using it in a parallell foreach and it is very stable. If you need to parse multiple URLs in a console app using a web browser, look no further. Thank you! – Dan Cook Mar 09 '14 at 23:32
  • 2
    @DanCook, no worries, glad it helped. If you're doing this in parallel, just make sure to limit the number of `WebBrowser` instances to a reasonable figure, like 3-4. You could use `SemaphoreSlim.WaitAsync` for this (a lot of examples of use here on SO). Another things to keep in mind, all `WebBrowser` instances share the same HTTP session (including cookies). – noseratio Mar 09 '14 at 23:52
  • Parallel.ForEach(myList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, myItem => This should keep the WebBroweser instances to max of 20, right? Production server has decent RAM and an SSD so hopefully 20 will be OK. Session is irrelevant in my case but thats a good tip for anyone else. – Dan Cook Mar 09 '14 at 23:55
  • @DanCook, that's not how I would do it. You only need to a thread to process the result, i.e., when `webBrowser.NavigateAsync` has completed. `Task.Run` would be a great fit for that. Otherwise, blocking a thread while `webBrowser.NavigateAsync` is "in-flight" is a bad idea. If interested, post a separate question, and I'll show what I mean with the code, as time allows. – noseratio Mar 10 '14 at 00:01
  • I send you an email to your gmail with my Class as an attachment as, without further discussion, I am not sure what my seperate question would be. In my email I suggested, if you are able to assist me at all, we could post the final question and solution on here and link to it. You are a TPL god – Dan Cook Mar 10 '14 at 00:38
  • @DanCook, thanks for the nice words, but I'm really not a TPL expert, I'm just starting to fully embrace it. [Stephen Cleary](http://blog.stephencleary.com/) is, and so is [Stephen Toub](http://blogs.msdn.com/pfxteam), the TPL architect. Their blogs are must reads, multiple times until fully understood. Regarding the question, just ask it here on SO, like how to organized a `WebBrowser`-based crawler with a certain degree of parallelism. Include a link to this question and tag it with [webbrowser-control]. I'll get to it :) – noseratio Mar 10 '14 at 00:47
  • 2
    I used Stephen Toub's example from here: http://blogs.msdn.com/b/pfxteam/archive/2011/11/10/10235834.aspx to attempt the task. The Rx based solution was also very interesting. Actually the paralell for each with MaxDegreeOfParallelism way seems to be working OK. 15000 records parsed, 20 concurrently and it hasn't crashed (yet?). Lets close this question but feel free to reply to my email still if its interesting to you. Kudos – Dan Cook Mar 10 '14 at 00:56
  • @Noseratio I've been threading 300-400 Webbrowsers in `Delphi` for ages now. With a custom multithreaded backend for each browser. So they are isolated from each other (Cookies, Proxy) the whole shebang! :) – user3655788 May 21 '14 at 08:45
  • @user3655788, glad you have, but I see no point in having 300-400 web browsers for what pretty much is a bunch of IO-bound download operations.The sensible number really depends on the through put capacity of your internet connection. – noseratio May 21 '14 at 08:55
  • @Noseratio No they all share a common thread pool based on `IOCP` so its very very efficient. – user3655788 May 21 '14 at 08:56
  • @user3655788, whether `WebBrowser` downloads use IOCP without blocking OS threads is not a 100% fact, to my knowledge. I'd be interested if you can prove that, but that's not the point. The point is, if you start 400 downloads in parallel and your inbound internet connection speed is 10Mbps, each download will be crawling at 2.5Kbbps, which is a dial-up modem speed. – noseratio May 21 '14 at 09:11
  • @Noseratio I've implemented bandwidth limitation. But on a 100mbit connection 400 browsers download in 3-6 seconds. Using a custom IOCP backend. – user3655788 May 21 '14 at 12:22
  • 1
    @Noseratio, Your code worked in a Console application. I tried almost your exact code in a WinForms application, but the only output I got is `URL: http://example.com`. What is it about WinForms causes an issue? Differences: I created a new class for your code- `Program2`. I added a button to a form, and the button calls `Program2.Start(new string[1]);`. `Start` in my class is what replaces `Main` in yours. I also tried another version where I use the default `public partial class Form1 : Form`, replacing your `Main` with a `button1_Click` that contains the body of your `Main`. No luck. Ideas? – Jacob Quisenberry Jun 14 '14 at 04:17
4

I suspect running a processing loop on another thread will not work out well, since WebBrowser is a UI component that hosts an ActiveX control.

When you're writing TAP over EAP wrappers, I recommend using extension methods to keep the code clean:

public static Task<string> NavigateAsync(this WebBrowser @this, string url)
{
  var tcs = new TaskCompletionSource<string>();
  WebBrowserDocumentCompletedEventHandler subscription = null;
  subscription = (_, args) =>
  {
    @this.DocumentCompleted -= subscription;
    tcs.TrySetResult(args.Url.ToString());
  };
  @this.DocumentCompleted += subscription;
  @this.Navigate(url);
  return tcs.Task;
}

Now your code can easily apply a timeout:

async Task<string> GetUrlAsync(string url)
{
  using (var wb = new WebBrowser())
  {
    var navigate = wb.NavigateAsync(url);
    var timeout = Task.Delay(TimeSpan.FromSeconds(5));
    var completed = await Task.WhenAny(navigate, timeout);
    if (completed == navigate)
      return await navigate;
    return null;
  }
}

which can be consumed as such:

private async Task<Uri> GetFinalUrlAsync(PortalMerchant portalMerchant)
{
  SetBrowserFeatureControl();
  if (string.IsNullOrEmpty(portalMerchant.Url))
    return null;
  var result = await GetUrlAsync(portalMerchant.Url);
  if (!String.IsNullOrEmpty(result))
    return new Uri(result);
  throw new Exception("Parsing Failed");
}
Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
  • Thanks, I tried your solution but the web browser has to be used on an STA thread and have a Message Loop Worker (like in my (Noseratio's) original code. I don't know how to factor this into your solution – Dan Cook Mar 07 '14 at 21:54
  • The code I wrote is intended to be called from the UI thread. It's possible to create a separate STA thread, but I wouldn't unless it was really necessary. – Stephen Cleary Mar 07 '14 at 22:03
  • WebBrowser must be run on an STA thread because of the way the ActiveX works. Very much appreciate your answer. For anyone who doesnt need to use a web browser - this does work, I tested it. – Dan Cook Mar 09 '14 at 23:56
  • I know this is pretty old, but I'm not able to avoid a 'not all code paths return a value' from the static Task NavigateAsync code. To avoid another issue I added 'ToString()' on the line with the 'TrySetResult'. Thanks for your help. – Jerome Jan 29 '18 at 10:54
  • @Stephen: Thanks Stephen. This is now what I read in your blog: https://blog.stephencleary.com/2012/02/async-and-await.html – Jerome Jan 29 '18 at 14:54
  • Sorry, I had to use the answer (I know it's not) but I don't know how to address my question directly linked to your inputs here. – Jerome Jan 30 '18 at 14:49
-1

I'm trying to take benefit from Noseratio's solution as well as following advices from Stephen Cleary.

Here is the code I updated to include in the code from Stephen the code from Noseratio regarding the AJAX tip.

First part: the Task NavigateAsync advised by Stephen

public static Task<string> NavigateAsync(this WebBrowser @this, string url)
{
  var tcs = new TaskCompletionSource<string>();
  WebBrowserDocumentCompletedEventHandler subscription = null;
  subscription = (_, args) =>
  {
    @this.DocumentCompleted -= subscription;
    tcs.TrySetResult(args.Url.ToString());
  };
  @this.DocumentCompleted += subscription;
  @this.Navigate(url);
  return tcs.Task;
}

Second part: a new Task NavAjaxAsync to run the tip for AJAX (based on Noseratio's code)

public static async Task<string> NavAjaxAsync(this WebBrowser @this)
{
  // get the root element
  var documentElement = @this.Document.GetElementsByTagName("html")[0];

  // poll the current HTML for changes asynchronosly
  var html = documentElement.OuterHtml;

  while (true)
  {
    // wait asynchronously
    await Task.Delay(POLL_DELAY);

    // continue polling if the WebBrowser is still busy
    if (webBrowser.IsBusy)
      continue;

    var htmlNow = documentElement.OuterHtml;
    if (html == htmlNow)
      break; // no changes detected, end the poll loop

    html = htmlNow;
  }

  return @this.Document.Url.ToString();
}

Third part: a new Task NavAndAjaxAsync to get the navigation and the AJAX

public static async Task NavAndAjaxAsync(this WebBrowser @this, string url)
{
  await @this.NavigateAsync(url);
  await @this.NavAjaxAsync();
}

Fourth and last part: the updated Task GetUrlAsync from Stephen with Noseratio's code for AJAX

async Task<string> GetUrlAsync(string url)
{
  using (var wb = new WebBrowser())
  {
    var navigate = wb.NavAndAjaxAsync(url);
    var timeout = Task.Delay(TimeSpan.FromSeconds(5));
    var completed = await Task.WhenAny(navigate, timeout);
    if (completed == navigate)
      return await navigate;
    return null;
  }
}

I'd like to know if this is the right approach.

Jerome
  • 366
  • 5
  • 17