3

I need to create a method in a class library to get the content of a URL (which may be dynamically populated by JavaScript).

I am clueless, but having googling for the whole day this is what I came up with: (Most of the code is from here)

using System;
using System.Threading.Tasks;
using System.Threading;
using System.Windows.Forms;

public static class WebScraper
{
    [STAThread]
    public async static Task<string> LoadDynamicPage(string url, CancellationToken token)
    {
        using (WebBrowser webBrowser = new WebBrowser())
        {
            // Navigate and await DocumentCompleted
            var tcs = new TaskCompletionSource<bool>();
            WebBrowserDocumentCompletedEventHandler onDocumentComplete = (s, arg) => tcs.TrySetResult(true);

            using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
            {
                webBrowser.DocumentCompleted += onDocumentComplete;
                try
                {
                    webBrowser.Navigate(url);
                    await tcs.Task; // wait for DocumentCompleted
                }
                finally
                {
                    webBrowser.DocumentCompleted -= onDocumentComplete;
                }
            }

            // get the root element
            var documentElement = webBrowser.Document.GetElementsByTagName("html")[0];

            // poll the current HTML for changes asynchronosly
            var html = documentElement.OuterHtml;
            while (true)
            {
                // wait asynchronously, this will throw if cancellation requested
                await Task.Delay(500, token);

                // continue polling if the WebBrowser is still busy
                if (webBrowser.IsBusy)
                    continue;

                var htmlNow = documentElement.OuterHtml;
                if (html == htmlNow)
                    break; // no changes detected, end the poll loop

                html = htmlNow;
            }

            // consider the page fully rendered 
            token.ThrowIfCancellationRequested();
            return html;
        }
    }
}

It currently throws this error

ActiveX control '8856f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current thread is not in a single-threaded apartment.

Am I close? Is there a fix for the above?

Or if I am off the track, is there a ready solution to get dynamic web content using .NET (that can be called from a class library)?

Community
  • 1
  • 1
Aximili
  • 28,626
  • 56
  • 157
  • 216

1 Answers1

3

Here is what I tested in a web application and worked properly.

It uses a WebBrowser control in another thread and returns a Task<string> containing which completes when the browser content load completely:

using System;
using System.Threading.Tasks;
using System.Threading;
using System.Windows.Forms;
public class BrowserBasedWebScraper
{
    public static Task<string> LoadUrl(string url)
    {
        var tcs = new TaskCompletionSource<string>();
        Thread thread = new Thread(() => {
            try {
                Func<string> f = () => {
                    using (WebBrowser browser = new WebBrowser())
                    {
                        browser.ScriptErrorsSuppressed = true;
                        browser.Navigate(url);
                        while (browser.ReadyState != WebBrowserReadyState.Complete)
                        {
                            System.Windows.Forms.Application.DoEvents();
                        }
                        return browser.DocumentText;
                    }
                };
                tcs.SetResult(f());
            }
            catch (Exception e) {
                tcs.SetException(e);
            }
        });
        thread.SetApartmentState(ApartmentState.STA);
        thread.IsBackground = true;
        thread.Start();
        return tcs.Task;
    }
}
Reza Aghaei
  • 120,393
  • 18
  • 203
  • 398
  • Thank you! It doesn't work with https://www.google.com/#q=where+am+i but it may be enough for what I need now – Aximili Oct 15 '16 at 09:44
  • 1
    You're welcome. About the other issue, I guess it's because `WebBrowser` control doesn't use latest version of your browser by default. You can force it to use the latest version. I've applied [the solution](http://stackoverflow.com/a/38514446/3110834) for a windows forms application. – Reza Aghaei Oct 15 '16 at 11:22
  • `System.Windows.Forms.Application.DoEvents` seems not to be the key. The issue is still there especially on heavy Ajax request, which may imply the solution might be on the Ajax request code side (see https://codesave.wordpress.com/2013/09/25/ajax-call-freezes-ui-animation-locked-ui-during-ajax-call/). – Jerome Nov 10 '20 at 11:00