4

I want to create a program to crawl and check my websites for http errors and other things. I want to do this with multiple threads that should accept parameters like the url to crawl. Although I want X threads to be active there are Y Tasks waiting already to be executed.

Now I wanted to know what is the best strategy to do this: ThreadPool, Tasks, Threads or even something else?

maddo7
  • 4,503
  • 6
  • 31
  • 51
  • 1
    try this http://stackoverflow.com/questions/4277844/multithreading-a-large-number-of-web-requests-in-c-sharp – Niventh Mar 29 '13 at 15:25
  • "best" is pretty hard to define. I suggest you study the "Related" questions (to the right, below), and pick the one you think would fit best in your application. Probably Tasks is the way to go, but that still leaves a lot of room for variation. – Jim Mischel Mar 29 '13 at 15:26

4 Answers4

7

Here's an example that shows how to queue up a bunch of tasks but limit the number that are concurrently running . It uses a Queue to keep track of tasks that are ready to run and uses a Dictionary to keep track of tasks that are running. When a task finishes it invokes a callback method to remove itself from the Dictionary. An async method is used to launch queued tasks as space becomes available.

using System;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;

namespace MinimalTaskDemo
{
    class Program
    {
        private static readonly Queue<Task> WaitingTasks = new Queue<Task>();
        private static readonly Dictionary<int, Task> RunningTasks = new Dictionary<int, Task>();
        public static int MaxRunningTasks = 100; // vary this to dynamically throttle launching new tasks 

        static void Main(string[] args)
        {
            var tokenSource = new CancellationTokenSource();
            var token = tokenSource.Token;
            Worker.Done = new Worker.DoneDelegate(WorkerDone);
            for (int i = 0; i < 1000; i++)  // queue some tasks
            {
                // task state (i) will be our key for RunningTasks
                WaitingTasks.Enqueue(new Task(id => new Worker().DoWork((int)id, token), i, token));
            }
            LaunchTasks();
            Console.ReadKey();
            if (RunningTasks.Count > 0)
            {
                lock (WaitingTasks) WaitingTasks.Clear();
                tokenSource.Cancel();
                Console.ReadKey();
            }
        }

        static async void LaunchTasks()
        {
            // keep checking until we're done
            while ((WaitingTasks.Count > 0) || (RunningTasks.Count > 0))
            {
                // launch tasks when there's room
                while ((WaitingTasks.Count > 0) && (RunningTasks.Count < MaxRunningTasks))
                {
                    Task task = WaitingTasks.Dequeue();
                    lock (RunningTasks) RunningTasks.Add((int)task.AsyncState, task);
                    task.Start();
                }
                UpdateConsole();
                await Task.Delay(300); // wait before checking again
            }
            UpdateConsole();    // all done
        }

        static void UpdateConsole()
        {
            Console.Write(string.Format("\rwaiting: {0,3:##0}  running: {1,3:##0} ", WaitingTasks.Count, RunningTasks.Count));
        }

        // callback from finished worker
        static void WorkerDone(int id)
        {
            lock (RunningTasks) RunningTasks.Remove(id);
        }
    }

    internal class Worker
    {
        public delegate void DoneDelegate(int taskId);
        public static DoneDelegate Done { private get; set; }
        private static readonly Random Rnd = new Random();

        public async void DoWork(object id, CancellationToken token)
        {
            for (int i = 0; i < Rnd.Next(20); i++)
            {
                if (token.IsCancellationRequested) break;
                await Task.Delay(100);  // simulate work
            }
            Done((int)id);
        }
    }
}
Ed Power
  • 8,310
  • 3
  • 36
  • 42
  • Hi, what's the purpose of the `for` loop with `Rnd`.. ? This one => `for (int i = 0; i < Rnd.Next(20); i++) { if (token.IsCancellationRequested) break; }` – Shiva Apr 10 '17 at 02:28
  • 1
    @Shiva - the loop merely simulates some amount of work being done. – Ed Power Apr 10 '17 at 13:26
  • 1
    Thanks. So if I was doing actual work in the `await...` line of the `DoWork` method, then I would remove this `for` loop. In that case, would I use `if (token.IsCancellationRequested) return;` in place of `if (token.IsCancellationRequested) break;`? – Shiva Apr 10 '17 at 21:40
4

I recommend using (asynchronous) Tasks for downloading the data and then processing (on the thread pool).

Instead of throttling tasks, I recommend you throttle the number of requests per target server. Good news: .NET already does this for you.

This makes your code as simple as:

private static readonly HttpClient client = new HttpClient();
public async Task Crawl(string url)
{
  var html = await client.GetString(url);
  var nextUrls = await Task.Run(ProcessHtml(html));
  var nextTasks = nextUrls.Select(nextUrl => Crawl(nextUrl));
  await Task.WhenAll(nextTasks);
}
private IEnumerable<string> ProcessHtml(string html)
{
  // return all urls in the html string.
}

which you can kick off with a simple:

await Crawl("http://example.org/");
Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
0

I would recommend going with the threadPool. Is is easy enough to work with, as it has a few benefits:

"Thread pool will provide benefits for frequent and relatively short operations by Reusing threads that have already been created instead of creating new ones (an expensive process) Throttling the rate of thread creation when there is a burst of requests for new work items (I believe this is only in .NET 3.5)

If you queue 100 thread pool tasks, it will only use as many threads as have already been created to service these requests (say 10 for example). The thread pool will make frequent checks (I believe every 500ms in 3.5 SP1) and if there are queued tasks, it will make one new thread. If your tasks are quick, then the number of new threads will be small and reusing the 10 or so threads for the short tasks will be faster than creating 100 threads up front.

If your workload consistently has large numbers of thread pool requests coming in, then the thread pool will tune itself to your workload by creating more threads in the pool by the above process so that there are a larger number of thread available to process requests"

Thread vs ThreadPool

Community
  • 1
  • 1
MeTitus
  • 3,390
  • 2
  • 25
  • 49
-1

Well, Task is a good way to go, because it would mean that you don't have to worry about writing a lot of the "plumbing" code.

I would recommend that you check out Joe Albahari's web site on threading as well, it's quite a good primer on threading:

http://www.albahari.com/threading/

code4life
  • 15,655
  • 7
  • 50
  • 82