5

First of all apologies for my lack of technical knowledge and probable miscommunication, I'm quite a newbie to C#.

I've taken over a project which scrapes a number of webpages and saves them as .png files.

private void CaptureWebPage(string URL, string filePath, ImageFormat format)
{
    System.Windows.Forms.WebBrowser web = new System.Windows.Forms.WebBrowser();
    web.ScrollBarsEnabled = false; 
    web.ScriptErrorsSuppressed = true; 
    web.Navigate(URL); 

    while (web.ReadyState != System.Windows.Forms.WebBrowserReadyState.Complete)
        System.Windows.Forms.Application.DoEvents();
    System.Threading.Thread.Sleep(5000);

    int width = web.Document.Body.ScrollRectangle.Width;
    width += width / 10;
    width = width <= 300 ? 600 : width; 


    int height = web.Document.Body.ScrollRectangle.Height;
    height += height / 10;

    web.Width = width;
    web.Height = height;

    _bmp = new System.Drawing.Bitmap(width, height);


    web.DrawToBitmap(_bmp, new System.Drawing.Rectangle(0, 0, width, height));
    _bmp.Save(filePath, format);

    _bmp.Dispose();

}

However, some of the pages (only a small few) cause the process to hang. Its not all the time, but fairly often. I've discovered the problem seems to be in the following part of code:

while (web.ReadyState != System.Windows.Forms.WebBrowserReadyState.Complete)
    System.Windows.Forms.Application.DoEvents();

It looks as though the web.ReadyState gets stuck at 'interactive' and never progresses to 'complete' so it just keeps looping.

Is it possible to put in code that causes the process to restart for that page if the web.ReadyState = 'Interactive' for a certain amount of time, and if so what would the syntax be?

Curtis
  • 101,612
  • 66
  • 270
  • 352
MarkyWil
  • 131
  • 1
  • 1
  • 6
  • That is a pretty intensive loop, I would be tempted to put a `100ms` sleep in there are see if that helps – musefan Jul 02 '13 at 08:41
  • Putting the sleep in seems to make no noticeable difference. Is it possible to force the web.ReadyState to go to 'Complete' from 'Interactive', perhaps after a certain amount of time? Not sure what kind of problems this might cause though. – MarkyWil Jul 02 '13 at 10:17
  • Perhaps it would be more reliable to use the [DocumentCompleted Event](http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.documentcompleted.aspx) to detect when the page has loaded. [More info here](http://stackoverflow.com/questions/11763189/webbrowser-document-completed-event-c-sharp), [and here](http://stackoverflow.com/questions/840813/how-to-use-webbrowser-control-documentcompleted-event-in-c) – musefan Jul 02 '13 at 12:36
  • Im not sure how exactly the page loads, and ive heard that the DocumentCompleted Event can be fired multiple times dependant upon certain things being loaded. – MarkyWil Jul 02 '13 at 13:08

2 Answers2

6

Ive replaced the existing problematic code with the following (found on thebotnet.com):

while (web.IsBusy)
    System.Windows.Forms.Application.DoEvents();
for (int i = 0; i < 500; i++)
    if (web.ReadyState != System.Windows.Forms.WebBrowserReadyState.Complete)
   {
       System.Windows.Forms.Application.DoEvents();
       System.Threading.Thread.Sleep(10); 
   }
   else
       break;
System.Windows.Forms.Application.DoEvents();

I've tested it a few times and all pages seem to be scraped fine. I'll continue testing it just in case, but if you have any information on issues it could cause please let me know, as I may not find them myself.

MarkyWil
  • 131
  • 1
  • 1
  • 6
0

VB.NET Code :

    While WebBrowser1.IsBusy
        System.Windows.Forms.Application.DoEvents()
    End While
    For i As Integer = 0 To 499
        If WebBrowser1.ReadyState <> System.Windows.Forms.WebBrowserReadyState.Complete Then
            System.Windows.Forms.Application.DoEvents()
            System.Threading.Thread.Sleep(10)
        Else
            Exit For
        End If
    Next
    System.Windows.Forms.Application.DoEvents()