0

I am using a WebBrowser control for web scraping pages on Yahoo news. I need to use a WebBrowser rather than HtmlAgilityPack to accommodate for JavaScript and the like.

Application Type: WinForm
.NET Framework: 4.5.1
VS: 2013 Ultimate
OS: Windows 7 Professional 64-bit

I am able to scrape the required text, but I am unable to return control of the application to the calling function or any other function when scraping is complete. I also cannot verify that scraping is complete.

I need to
1. Verify that all page loads and scraping have completed.
2. Perform actions on a list of the results, as by alphabetizing them.
3. Do something with the data, such as displaying text contents in a Text box or writing them to SQL.

I declare new class variables for the WebBrowser and a list of URLs and an object with a property that contains a list of news articles..

public partial class Form1 : Form
{
   public WebBrowser w = new WebBrowser();    //WebBrowser
   public List<String> lststrURLs = new List<string>();  //URLs
   public ProcessYahooNews pyn = new ProcessYahooNews();  //Contains articles
...
   lststrURLs.Add("http://news.yahoo.com/sample01");
   lststrURLs.Add("http://news.yahoo.com/sample02");
   lststrURLs.Add("http://news.yahoo.com/sample03");

Pressing a button, whose handler is calling function, calls this code.

w.Navigate(strBaseURL + lststrTickers[0]); //invokes w_Loaded

foreach (YahooNewArticle article in pyn.articles)
{
    textBox1.Text += article.strHeadline + "\r\n";
    textBox1.Text += article.strByline + "\r\n";
    textBox1.Text += article.strContent + "\r\n";
    textBox1.Text += article.dtDate.ToString("yyyymmdd") + "\r\n\r\n";
}

The first problem I have is that program control appears to skip over w.Navigate and pass directly to the foreach block, which does nothing since articles has not been populated yet. Only then is w.Navigate executed.

If I could get the foreach block to wait until after w.Navigate did its work, then many of my problems would be solved. Absent that, w.Navigate will do its work, but then I need control passed back to the calling function.

I have worked on a partial work-around.

w.Navigate loads a page into the WebBrowser. When it is done loading, the event w.DocumentCompleted fires. I am handling the event with w_Loaded, which uses a class with logic to perform the web scraping.

// Sets up the class
pyn.ProcessYahooNews_Setup(w, e);
// Perform the scraping
pyn.ProcessLoad();

The result of the scraping is that pyn.articles is populated. The next page is loaded only when criteria, such as pyn.articles.Count > 0.

if (pyn.articles.Count > 0)
{
    //Navigate to the next page
    i++;
    w.Navigate(lststrURLs[i]);
}

More pages are scraped, and articles.Count grows. However, I cannot determine that scraping is done - that there will not be more page loads resulting in more articles.

Suppose I am confident that the scraping is done, I need to make articles available for further handling, as by sorting it as a list, removing certain elements, and displaying its textual content to a TextBox.

That takes me back the foreach block that was called too early. Now, I need it, but I have no way to get articles into the foreach. I don't think I can call some other function from w_Loaded to the handling for me because it would be called for each page load, and I need to call the function once after all page loads.

It occurs to me that some threaded architecture might help, but I could use some help on figuring out what the architecture would look like.

Jacob Quisenberry
  • 1,131
  • 3
  • 20
  • 48
  • 1
    FYI: http://stackoverflow.com/questions/22239357/how-to-cancel-task-await-after-a-timeout-period/22262976#22262976 – noseratio Jun 08 '14 at 22:42
  • 1
    @Noseratio, thank you. I thought that maybe an architecture that makes heavy use of threads or tasks might be a solution. The article you pointed to seems to validate that belief. – Jacob Quisenberry Jun 10 '14 at 16:33
  • Not exactly heavy use of threads, though. Rather, extensive use of asynchronous code. In fact, for a WinFroms app you don't need any additional threads. For a console app, you need a dedicate STA thread to run the message loop for as many `WebBrowser` instances as you want. Check this: http://stackoverflow.com/questions/23808061/use-threadpool-to-limit-max-number-of-threads-attempted-to-read-or-write-prote/23819021#23819021 – noseratio Jun 12 '14 at 01:25
  • @Noseratio, I have questions about your code, but I will post them under your post at [link]http://stackoverflow.com/questions/22239357/how-to-cancel-task-await-after-a-timeout-period/22262976#22262976 – Jacob Quisenberry Jun 14 '14 at 04:19
  • Answered [here](http://stackoverflow.com/a/24225505/1768303). – noseratio Jun 15 '14 at 01:16

0 Answers0