I am using a WebBrowser control for web scraping pages on Yahoo news. I need to use a WebBrowser rather than HtmlAgilityPack to accommodate for JavaScript and the like.
Application Type: WinForm
.NET Framework: 4.5.1
VS: 2013 Ultimate
OS: Windows 7 Professional 64-bit
I am able to scrape the required text, but I am unable to return control of the application to the calling function or any other function when scraping is complete. I also cannot verify that scraping is complete.
I need to
1. Verify that all page loads and scraping have completed.
2. Perform actions on a list of the results, as by alphabetizing them.
3. Do something with the data, such as displaying text contents in a Text box or writing them to SQL.
I declare new class variables for the WebBrowser and a list of URLs and an object with a property that contains a list of news articles..
public partial class Form1 : Form
{
public WebBrowser w = new WebBrowser(); //WebBrowser
public List<String> lststrURLs = new List<string>(); //URLs
public ProcessYahooNews pyn = new ProcessYahooNews(); //Contains articles
...
lststrURLs.Add("http://news.yahoo.com/sample01");
lststrURLs.Add("http://news.yahoo.com/sample02");
lststrURLs.Add("http://news.yahoo.com/sample03");
Pressing a button, whose handler is calling function, calls this code.
w.Navigate(strBaseURL + lststrTickers[0]); //invokes w_Loaded
foreach (YahooNewArticle article in pyn.articles)
{
textBox1.Text += article.strHeadline + "\r\n";
textBox1.Text += article.strByline + "\r\n";
textBox1.Text += article.strContent + "\r\n";
textBox1.Text += article.dtDate.ToString("yyyymmdd") + "\r\n\r\n";
}
The first problem I have is that program control appears to skip over w.Navigate
and pass directly to the foreach
block, which does nothing since articles
has not been populated yet. Only then is w.Navigate executed.
If I could get the foreach
block to wait until after w.Navigate
did its work, then many of my problems would be solved. Absent that, w.Navigate will do its work, but then I need control passed back to the calling function.
I have worked on a partial work-around.
w.Navigate loads a page into the WebBrowser. When it is done loading, the event w.DocumentCompleted
fires. I am handling the event with w_Loaded
, which uses a class with logic to perform the web scraping.
// Sets up the class
pyn.ProcessYahooNews_Setup(w, e);
// Perform the scraping
pyn.ProcessLoad();
The result of the scraping is that pyn.articles
is populated. The next page is loaded only when criteria, such as pyn.articles.Count > 0
.
if (pyn.articles.Count > 0)
{
//Navigate to the next page
i++;
w.Navigate(lststrURLs[i]);
}
More pages are scraped, and articles.Count
grows. However, I cannot determine that scraping is done - that there will not be more page loads resulting in more articles.
Suppose I am confident that the scraping is done, I need to make articles
available for further handling, as by sorting it as a list, removing certain elements, and displaying its textual content to a TextBox.
That takes me back the foreach
block that was called too early. Now, I need it, but I have no way to get articles
into the foreach
. I don't think I can call some other function from w_Loaded to the handling for me because it would be called for each page load, and I need to call the function once after all page loads.
It occurs to me that some threaded architecture might help, but I could use some help on figuring out what the architecture would look like.