1

I have a Windows Desktop application that is used to do WebScraping on a website using WebBrowser.

I had to use WebBrowser because the website implements some Javascript function so that was the only way to get the html content of the pages.

The program has to parse about 1500 pages so I have implemented a task delay in order to avoid to overload the server ( and may be getting banned ).

The problem is that after 50-100 parsed pages, I get an out of memory error and the program gets closed.

This is the code:

private async void buttonProd_Click(object sender, EventArgs e)
{
    const string C_Prod_UrlTemplate = "http://www.mysite.it";

    var _searches = new List<Get_SiteSearchResult>();
    using (ProdDataContext db = new ProdDataContext())
    {
        _searches = db.Get_SiteSearch("PROD").ToList();
        foreach (var s in _searches)
        {
            WebBrowser wb1 = new WebBrowser();
            wb1.ScriptErrorsSuppressed = true;

            Uri uri = new Uri(String.Format(C_Prod_UrlTemplate,s.prod));

            wb1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser_DocumentCompleted);                    

            wb1.Url = uri;
            await Task.Delay(90 * 1000);
        }
    }
}

private void webBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    using (ProdDataContext db = new ProdDataContext())
    {
        WebBrowser wb = (WebBrowser)sender;

        string s = wb.Document.Body.InnerHtml;

        string fName = wb.CodSite + "_" + wb.PostId + ".txt";

        File.WriteAllText(wb.FolderPath + @"LINKS\" + fName, s);

        db.Set_LinkDownloaded(wb.CodSite, wb.PostId);        
    }
}

The error messa is generated on this command line in webBrowser_DocumentCompleted method:

string s = wb.Document.Body.InnerHtml;

Thanks to support

DarioN1
  • 2,460
  • 7
  • 32
  • 67
  • Once you have used the contents of the page you might like to call Dispose to free up the browser (and maybe the memory) – Rob Dec 16 '17 at 11:48
  • It looks like you are creating a `WebBrowser` control for each fetch. Not sure if this is the cause, but it seems too heavy for such a simple task (fetch URL content). You can use `WebClient` as indicated [here](https://stackoverflow.com/questions/1048199/easiest-way-to-read-from-a-url-into-a-string-in-net) or a library that also allows for powerful processing of the content like [HtmlAgilityPack](https://stackoverflow.com/questions/1048199/easiest-way-to-read-from-a-url-into-a-string-in-net). – Alexei - check Codidact Dec 16 '17 at 11:49
  • I try to dispose the webbrowser after I get the content. @Alexei , I cannot use WebClient because the final content I have to get is obtained after some Javascript redirects... – DarioN1 Dec 16 '17 at 12:01
  • Where should I have to dispose the webbrowser in your opinion ? It should be already get disposed after each loop... – DarioN1 Dec 16 '17 at 12:12

1 Answers1

0

Instead of using a control (which is a rather complex construct that requires more memory than a simple object), you can simply fetch the string (the HTML code only) associated with an URL like this:

using(WebClient wc = new WebClient()) {
   string s = wc.DownloadString(url);
   // do stuff with content
}

Of course, you should ensure some error handling (maybe even a retrial mechanism) and put some delays to ensure you are not doing too much requests per time interval.

Alexei - check Codidact
  • 22,016
  • 16
  • 145
  • 164
  • I cannot use WebClient because the pages often are redirected by javascript to other pages so the WebBrowser is the only way to get what I need – DarioN1 Dec 16 '17 at 11:59
  • @DarioN1 - oh, it can be compensated for, as indicated by [this question and its answer](https://stackoverflow.com/questions/13039068/webclient-does-not-automatically-redirect). – Alexei - check Codidact Dec 16 '17 at 12:20
  • @DarioN1 - also, if you try HtmlAgilityPack, check [this question](https://stackoverflow.com/questions/6239319/c-sharp-htmlagility-pack-capturing-redirct). – Alexei - check Codidact Dec 16 '17 at 12:22