Download many pages from web by BackgroundWorker component

Question

I have many url's (about 800) to download from web. I have a class: HttpDownloader.cs that uses with HttpWebRequest class to download and get the html pages. After that I do pharsing to the pages by Regex.

I want to use BackgroundWorker component, But I don't know how to do It for all the pages. by a loop, or something like that.

My code:

I tried use with ThreadPool, and it realy did problems. I tried with 4 url's and it didn't work.

      foreach (string link in MyListOfUrls)
      {
 ThreadPool.QueueUserWorkItem((o) => {

           HttpDownloader httpDownload = new HttpDownloader(link);
           string htmlDoc = httpDownload.GetPage();//get the html of the page 
           HtmlDocument doc=doc.LoadHtml(htmlDoc);//load html string to doc for pharsing
           DoPharsing();//my func for pharsing
           Save();//save into dataBase
  });
      }

Because I use with connection to dataBase and DataTable in my func I get an exception when I use ThreadPool:

"Function evaluation disabled because a previous function evaluation timed out. You must continue execution to reenable function evaluation."

So, I can't get a data from the DataTable. maybe I need to download all, and afterwards do pharsing and save??

How can I Change it to Async by BackgroundWorker component??

p.s. Don't advice me with Async Tpc, because I didn't manage to download it.

Thanks

Do you want to perform multiple download at the same time or simply separate the Download from the GUI (make it asynchronous) ? (BTW it's parsing, not pharsing) — digEmAll, May 01 '12 at 08:47
@digEmAll, I want to perform multiple download at the same time. To Download **all** the pages more more quick. — Chani Poz, May 01 '12 at 08:51
What have you tried? There are numerous tutorials on the Internet for the background worker class. How far did you get with any of those tutorials and what specifically are you getting stuck on? Please post your code attempt at using BackgroundWorker. — Merlyn Morgan-Graham, May 01 '12 at 08:55
Here's a tutorial I wrote for a question a while back on how to use BackgroundWorker. http://stackoverflow.com/a/6578532/232593 — Merlyn Morgan-Graham, May 01 '12 at 09:00
@ Merlyn Morgan-Graham, I tried thread without background worker class. but it didn't work at all. — Chani Poz, May 01 '12 at 09:07
@Chanipoz: Show us that broken code and we may be able to help you figure out where you went wrong. Having someone write the code for you will help you less. — Merlyn Morgan-Graham, May 01 '12 at 09:09
Merlyn Morgan-Graham, thank, I looked at your link, but it's help to download just **one** page, not more. — Chani Poz, May 01 '12 at 09:10

score 1 · Answer 1 · answered May 01 '12 at 08:58

1

It depends on what you want to split off, the whole loop, or just the download part of the loop. Obviously if you want the whole loop to be in the background then the easiest way is just to use the ThreadPool.

Note, you wil likely have to change your parsing and save functions so you are passing in the HTML document to each function.

ThreadPool.QueueUserWorkItem((o) => {
  foreach (string link in MyListOfUrls)
  {
    HttpDownloader httpDownload = new HttpDownloader(link);
    string htmlDoc = httpDownload.GetPage();//get the html of the page
    HtmlDocument doc=doc.LoadHtml(htmlDoc);//load html string to doc for pharsing
    var result = DoPharsing(doc);//my func for pharsing
    Save(result);//save into dataBase
 } 
});

or

BackgroundWorker worker = new BackgroundWorker();
worker.DoWork += (o, e) => { 
  foreach (string link in MyListOfUrls)
  {
    HttpDownloader httpDownload = new HttpDownloader(link);
    string htmlDoc = httpDownload.GetPage();//get the html of the page
    HtmlDocument doc=doc.LoadHtml(htmlDoc);//load html string to doc for pharsing
    var result = DoPharsing(doc);//my func for pharsing
    Save(result);//save into dataBase
 } 
};
worker.RunWorkerCompleted += (o, e) => {
   // Job completed
}
worker.RunWorkerAsync();

To download multiple links at the same time simply switch out where you are creating the thread.:

foreach (string link in MyListOfUrls)
{
  ThreadPool.QueueUserWorkItem((o) => {
    HttpDownloader httpDownload = new HttpDownloader(link);
    string htmlDoc = httpDownload.GetPage();//get the html of the page
    HtmlDocument doc=doc.LoadHtml(htmlDoc);//load html string to doc for pharsing
    var result = DoPharsing(doc);//my func for pharsing
    Save(result);//save into dataBase
  });
 }

(Better the user the thread pool here than creating hundredds of background workers I think).

answered May 01 '12 at 08:58

samjudson

56,243
7
59
69

You could also do `Parallel.ForEach`. The problem with the implementation here is that it doesn't support clean cancellation. You'll have to wait until the threads complete to cleanly kill your downloads - either when the downloads have completed, or the connections have timed out. The only solution to this would be to use a download mechanism that was non-blocking, in which case you'd have no need to queue threads on your own. – Merlyn Morgan-Graham May 01 '12 at 09:05
What the different from the 2 example above?? Is it download async, as download the pages in the same time?? – Chani Poz May 01 '12 at 09:05
@Chanipoz: There are three examples here. The first two are nearly identical, and are only helpful because they'd keep your GUI from hanging. They don't download multiple documents at the same time. The third would download multiple documents at the same time, but could queue up a lot of thread pool work items. I'm guessing since it "queues them up" that this isn't a problem, and it will internally only run so many tasks at the same time. – Merlyn Morgan-Graham May 01 '12 at 09:08
The only difference really between the first two examples is that in the second you get the completed event to let your UI know that you have finished. – samjudson May 01 '12 at 09:16
I'm not that happy with your ThreadPool solution... we must assume that `GetPage` runs synchronously and therefore will block for a significant amount of time. This being the case, it is not a suitable workload for the ThreadPool. The ThreadPool will starve because it is designed to keep thread count to a minimum and therefore exhibits considerable latency spinning up new threads in response to a large work queue. In turn, this will affect other non-related APIs, such as Threading.Timer which executes its callback in the ThreadPool. Now our Timer isn't working properly!. – spender May 01 '12 at 09:29
In such a situation, increasing the ThreadPool threads is rarely the right option. It's better to rewrite your workload to take advantage of asynchronous IO. This means that ThreadPool tasks are short lived, and starvation no longer occurs. – spender May 01 '12 at 09:32

score 0 · Accepted Answer · answered May 01 '12 at 17:01

I found eventually my answer
Here is my code:

static BackgroundWorker[] d=new BackgroundWorker[MyListOfUrls.Length];
  string html=new string[MyListOfUrls.Length]

  static void Main(string[] args)
  {
    for (int i = 0; i < MyListOfUrls.Length; i++)
    {
         d[i]=new BackgroundWorker{WorkerReportsProgress=true};
         d[i].DoWork += new DoWorkEventHandler(worker2_DoWork);
         d[i].ProgressChanged += new ProgressChangedEventHandler(Program_ProgressChanged);
         d[i].RunWorkerAsync(i);
         d[i].RunWorkerCompleted += new RunWorkerCompletedEventHandler(RunWorkerCompleted);
         Thread.Sleep(1000);
    }
  }  

  static void RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
  {
      Console.WriteLine("End");
  }

  static void Program_ProgressChanged(object sender, ProgressChangedEventArgs e)
  {
      Console.WriteLine(e.ProgressPercentage.ToString());
  }

  static void worker2_DoWork(object sender, DoWorkEventArgs e)
  {
      var worker = (BackgroundWorker)sender;
      worker.ReportProgress((int)e.Argument);

      HttpDownloader httpDownload = new HttpDownloader(link);
      html[(int)e.Argument] = httpDownload.GetPage();

      Thread.Sleep(500);
  }

If anyone know how to do it better, I will be happy. Thaks, Chani

Download many pages from web by BackgroundWorker component

2 Answers2