3

I am trying to use a loop to download a bunch of html pages and scrap inside data. But those pages have some javascript job runing when loading. So I am thinking using webclient may not be a good choice. But if I use webBrowser like below. it return empty html string after first call in the loop.

WebBrowser wb = new WebBrowser();
        wb.ScrollBarsEnabled = false;
        wb.ScriptErrorsSuppressed = true;
        wb.Navigate(url);
        while (wb.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); Thread.Sleep(1000); }
        html = wb.Document.DomDocument.ToString();
Mike Long
  • 363
  • 4
  • 16

1 Answers1

5

Your are correct that WebClient & all of the other HTTP client interfaces will completely ignore JavaScript; none of them are Browsers after all.

You want:

var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;

Note that if you load via a WebBrowser you don't need to scrape the raw markup; you can use DOM methods like GetElementById/TagName and so on.

The while loop is very VBScript, there is a DocumentCompleted event you should wire your code into.


private void Whatever()
{
    WebBrowser wb = new WebBrowser();
    wb.DocumentCompleted += Wb_DocumentCompleted;

    wb.ScriptErrorsSuppressed = true;
    wb.Navigate("http://stackoverflow.com");
}

private void Wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    var wb = (WebBrowser)sender;

    var html = wb.Document.GetElementsByTagName("HTML")[0].OuterHtml;
    var domd = wb.Document.GetElementById("copyright").InnerText;
    /* ... */
}
Alex K.
  • 171,639
  • 30
  • 264
  • 288
  • Thanks a lot Alex. This is the exact answer I am looking for. Could you show me how to add DocumentCompleted event please. – Mike Long Feb 12 '16 at 15:16
  • Edited with example. – Alex K. Feb 12 '16 at 15:26
  • Alex. Thanks. This is console application. I use this code but didn't trigger Wb_DocumentCompleted function. – Mike Long Feb 12 '16 at 16:07
  • 1
    Oh. See the threading comments: [Using WebBrowser in a console application](http://stackoverflow.com/questions/6324810/using-webbrowser-in-a-console-application) – Alex K. Feb 12 '16 at 16:19
  • Alex. Thanks a million. You saved my day. – Mike Long Feb 12 '16 at 16:48
  • I get ActiveX control '8856f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current thread is not in a single-threaded apartment.' error. Can you help please? – Arya Apr 28 '23 at 04:38