3

I have this sample code.

List<Dictionary<string,string>> objects = new List<Dictionary<string,string>>();

foreach (string url in urls)
{
    objects.add(processUrl(url))
}

I need to process the URL, processUrl down load the page and run many regex to extract some informations and return a "C# JSON like" object, so I want to run this in parallels and in the end I need a list of objects so i need to wait all tasks to continue process, how can I accomplish this? I se many example but none saving the return.

Regards

Alex Aza
  • 76,499
  • 26
  • 155
  • 134
waldecir
  • 376
  • 2
  • 8
  • 21

3 Answers3

2

Like this?

var results = urls.AsParallel().Select(processUrl).ToList();

With Parallel:

Parallel.ForEach(
    urls, 
    url =>
    {
        var result = processUrl(url);
        lock (syncOjbect)
            objects.Add(result);
    };

or

var objects = new ConcurrentBag<Dictionary<string,string>>();
Parallel.ForEach(urls, url => objects.Add(processUrl(url)));
var result = objects.ToList();

or with Tasks:

var tasks = urls
    .Select(url => Task.Factory.StartNew(() => processUrl(url)))
    .ToArray();

Task.WaitAll(tasks);
var restuls = tasks.Select(arg => arg.Result).ToList();
Alex Aza
  • 76,499
  • 26
  • 155
  • 134
  • 1
    Rather than use a lock in the body of Parallel.ForEach, I'd use the overload that has a `localInit` and `localFinally` and aggregate all of the result in the localFinally. That way you aren't locking on each operation, only once per thread. Put an empty list in the localInit, add to the local without locking in the body, and collect in finally. – vcsjones Jun 12 '11 at 02:22
  • None of these options as-is provide a way to limit the total number of simultaneous tasks. – Rick Sladkey Jun 12 '11 at 05:41
  • @Rick Sladkey - Not sure I understood your comment. All 3 options have a way to limit the number of simultaneous tasks, I didn't show this in the code, as this wasn't asked. – Alex Aza Jun 12 '11 at 05:53
  • I'm just warning anyone to be careful before you start a thousand threads. – Rick Sladkey Jun 12 '11 at 06:02
  • 1
    @Rick Sladkey - actually none of those approaches by default will start too many threads. The decision to make a thread or wait is made by scheduler, and by default it will be based on number of CPU cores and thread waits. On the contrary - if you want to start unlimited number of the threads with this approach, this will be a problem. However, it is solvable too, if you use `Task` approach. – Alex Aza Jun 12 '11 at 06:07
  • Not for I/O bound tasks it won't. The cores will still be idle. – Rick Sladkey Jun 12 '11 at 06:13
  • @Rick Sladkey - agree about IO. In this case all of those 3 approaches have an option to set degree of parallelism. – Alex Aza Jun 12 '11 at 06:16
  • Agreed. And for I/O bound tasks it's wise to use them. :-) – Rick Sladkey Jun 12 '11 at 06:19
  • Wow so many optins, but i thin parallel.foreach is more "readible" so the *task is* I/O intensive, my processor still 8% but network still very low too, the comment talk about degree of arallelism so, how i can set this in parallel.foreach ? – waldecir Jun 12 '11 at 15:19
  • @waldecir - create `ParallelOptions` object instance and set `MaxDegreeOfParallelism` property. Pass the instance to one of the overloaded `Parallel.Foreach` method. – Alex Aza Jun 12 '11 at 16:49
  • @waldecir - As I understood from your last comment you don't really need to limit `MaxDegreeOfParallelism`. If you see that CPU and Network are idle, try using `Task` approach and supply `TaskCreationOptions.LongRunning` as a parameter to `StartNew` method. – Alex Aza Jun 12 '11 at 17:05
  • @Alex, can you explain this part () => processUrl(url) ? i dont get it – waldecir Jun 12 '11 at 17:50
  • @waldecir - this is anonymous method expressed with lambda expression. It's a big topic, difficult to explain it shortly. You might want to google a little about this. – Alex Aza Jun 12 '11 at 17:56
0

First, refactor as

processUrl(url, objects);

and make the task responsible for adding the results to the list.

Then add locking so two parallel tasks don't try to use the results list at exactly the same time.


Note: async support in the next version of .NET will make this trivially easy.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
-1

You can use PLinq extensions, this requires the .NET 4.0

System.Threading.Tasks.Parallel
          .ForEach(urls, url => {
             var result = processUrl(url);
             lock(objects)
             {
                  objects.Add(result);
             }
           });
Waleed
  • 3,105
  • 2
  • 24
  • 31
  • 3
    Parallel.For could be used instead as long the items are in a list-like collection (e.g. an array or `List`). The loop index could then be used to output the result into an array slot. This would be thread-safe with no locks required. – bobbymcr Jun 12 '11 at 02:21