Why is Parallel.ForEach much faster then AsParallel().ForAll() even though MSDN suggests otherwise?

Question

I've been doing some investigation to see how we can create a multithreaded application that runs through a tree.

To find how this can be implemented in the best way I've created a test application that runs through my C:\ disk and opens all directories.

class Program
{
    static void Main(string[] args)
    {
        //var startDirectory = @"C:\The folder\RecursiveFolder";
        var startDirectory = @"C:\";

        var w = Stopwatch.StartNew();

        ThisIsARecursiveFunction(startDirectory);

        Console.WriteLine("Elapsed seconds: " + w.Elapsed.TotalSeconds);

        Console.ReadKey();
    }

    public static void ThisIsARecursiveFunction(String currentDirectory)
    {
        var lastBit = Path.GetFileName(currentDirectory);
        var depth = currentDirectory.Count(t => t == '\\');
        //Console.WriteLine(depth + ": " + currentDirectory);

        try
        {
            var children = Directory.GetDirectories(currentDirectory);

            //Edit this mode to switch what way of parallelization it should use
            int mode = 3;

            switch (mode)
            {
                case 1:
                    foreach (var child in children)
                    {
                        ThisIsARecursiveFunction(child);
                    }
                    break;
                case 2:
                    children.AsParallel().ForAll(t =>
                    {
                        ThisIsARecursiveFunction(t);
                    });
                    break;
                case 3:
                    Parallel.ForEach(children, t =>
                    {
                        ThisIsARecursiveFunction(t);
                    });
                    break;
                default:
                    break;
            }

        }
        catch (Exception eee)
        {
            //Exception might occur for directories that can't be accessed.
        }
    }
}

What I have encountered however is that when running this in mode 3 (Parallel.ForEach) the code completes in around 2.5 seconds (yes I have an SSD ;) ). Running the code without parallelization it completes in around 8 seconds. And running the code in mode 2 (AsParalle.ForAll()) it takes a near infinite amount of time.

When checking in process explorer I also encounter a few strange facts:

Mode1 (No Parallelization):
Cpu:     ~25%
Threads: 3
Time to complete: ~8 seconds

Mode2 (AsParallel().ForAll()):
Cpu:     ~0%
Threads: Increasing by one per second (I find this strange since it seems to be waiting on the other threads to complete or a second timeout.)
Time to complete: 1 second per node so about 3 days???

Mode3 (Parallel.ForEach()):
Cpu:     100%
Threads: At most 29-30
Time to complete: ~2.5 seconds

What I find especially strange is that Parallel.ForEach seems to ignore any parent threads/tasks that are still running while AsParallel().ForAll() seems to wait for the previous Task to either complete (which won't soon since all parent Tasks are still waiting on their child tasks to complete).

Also what I read on MSDN was: "Prefer ForAll to ForEach When It Is Possible"

Source: http://msdn.microsoft.com/en-us/library/dd997403(v=vs.110).aspx

Does anyone have a clue why this could be?

Edit 1:

As requested by Matthew Watson I've first loaded the tree in memory before looping through it. Now the loading of the tree is done sequentially.

The results however are the same. Unparallelized and Parallel.ForEach now complete the whole tree in about 0.05 seconds while AsParallel().ForAll still only goes around 1 step per second.

Code:

class Program
{
    private static DirWithSubDirs RootDir;

    static void Main(string[] args)
    {
        //var startDirectory = @"C:\The folder\RecursiveFolder";
        var startDirectory = @"C:\";

        Console.WriteLine("Loading file system into memory...");
        RootDir = new DirWithSubDirs(startDirectory);
        Console.WriteLine("Done");


        var w = Stopwatch.StartNew();

        ThisIsARecursiveFunctionInMemory(RootDir);

        Console.WriteLine("Elapsed seconds: " + w.Elapsed.TotalSeconds);

        Console.ReadKey();
    }        

    public static void ThisIsARecursiveFunctionInMemory(DirWithSubDirs currentDirectory)
    {
        var depth = currentDirectory.Path.Count(t => t == '\\');
        Console.WriteLine(depth + ": " + currentDirectory.Path);

        var children = currentDirectory.SubDirs;

        //Edit this mode to switch what way of parallelization it should use
        int mode = 2;

        switch (mode)
        {
            case 1:
                foreach (var child in children)
                {
                    ThisIsARecursiveFunctionInMemory(child);
                }
                break;
            case 2:
                children.AsParallel().ForAll(t =>
                {
                    ThisIsARecursiveFunctionInMemory(t);
                });
                break;
            case 3:
                Parallel.ForEach(children, t =>
                {
                    ThisIsARecursiveFunctionInMemory(t);
                });
                break;
            default:
                break;
        }
    }
}

class DirWithSubDirs
{
    public List<DirWithSubDirs> SubDirs = new List<DirWithSubDirs>();
    public String Path { get; private set; }

    public DirWithSubDirs(String path)
    {
        this.Path = path;
        try
        {
            SubDirs = Directory.GetDirectories(path).Select(t => new DirWithSubDirs(t)).ToList();
        }
        catch (Exception eee)
        {
            //Ignore directories that can't be accessed
        }
    }
}

Edit 2:

After reading the update on Matthew's comment I've tried to add the following code to the program:

ThreadPool.SetMinThreads(4000, 16);
ThreadPool.SetMaxThreads(4000, 16);

This however does not change how the AsParallel peforms. Still the first 8 steps are being executed in an instant before slowing down to 1 step / second.

(Extra note, I'm currently ignoring the exceptions that occur when I can't access a Directory by the Try Catch block around the Directory.GetDirectories())

Edit 3:

Also what I'm mainly interested in is the difference between Parallel.ForEach and AsParallel.ForAll because to me it's just strange that for some reason the second one creates one Thread for every recursion it does while the first once handles everything in around 30 threads max. (And also why MSDN suggests to use the AsParallel even though it creates so much threads with a ~1 second timeout)

Edit 4:

Another strange thing I found out: When I try to set the MinThreads on the Thread pool above 1023 it seems to ignore the value and scale back to around 8 or 16: ThreadPool.SetMinThreads(1023, 16);

Still when I use 1023 it does the first 1023 elements very fast followed by going back to the slow pace I've been experiencing all the time.

Note: Also literally more then 1000 threads are now created (compared to 30 for the whole Parallel.ForEach one).

Does this mean Parallel.ForEach is just way smarter in handling tasks?

Some more info, this code prints twice 8 - 8 when you set the value above 1023: (When you set the values to 1023 or lower it prints the correct value)

        int threadsMin;
        int completionMin;
        ThreadPool.GetMinThreads(out threadsMin, out completionMin);
        Console.WriteLine("Cur min threads: " + threadsMin + " and the other thing: " + completionMin);

        ThreadPool.SetMinThreads(1023, 16);
        ThreadPool.SetMaxThreads(1023, 16);

        ThreadPool.GetMinThreads(out threadsMin, out completionMin);
        Console.WriteLine("Now min threads: " + threadsMin + " and the other thing: " + completionMin);

Edit 5:

As of Dean's request I've created another case to manually create tasks:

case 4:
    var taskList = new List<Task>();
    foreach (var todo in children)
    {
        var itemTodo = todo;
        taskList.Add(Task.Run(() => ThisIsARecursiveFunctionInMemory(itemTodo)));
    }
    Task.WaitAll(taskList.ToArray());
    break;

This is also as fast as the Parallel.ForEach() loop. So we still don't have the answer to why AsParallel().ForAll() is so much slower.

With `ThreadPool.SetMinThreads(4000, 4000);` you are setting the IO completion port threads to an insane number. Try `ThreadPool.SetMinThreads(4000, 16);` instead (same for `SetMaxThreads()`) — Matthew Watson, Sep 18 '14 at 12:31
I've done that now but somehow still experience the same results. For Mode 2 I see 1 extra thread popping up in Resource monitor every second. Also when I enable to Console.WriteLine I see it moving through my disk at about 1 level per second. Mode 1 and 3 still execute my whole disk (80.173 elements) in much less then 1 second (the in memory one) — Devedse, Sep 18 '14 at 12:45
Can you use a Task-oriented version – meaning that you await the task of a recursive call, and don’t kick off a new task until the call has finished? The problem here is that you are quickly swamping the number of threads to a large number, when you only have something like 4 CPU cores, when this problem is in-memory and CPU-bound. — Dean, Sep 18 '14 at 21:01
@Dean, I've created a 4th case to test your solution but also found out that this is just as fast as Parallel.ForEach. I don't think it has much to do with memory operations and CPU because I see almost no CPU activity. I think it's more related to some kind of timeout somewhere in the AsParallel().ForAll() method. — Devedse, Sep 19 '14 at 11:04
@Devedse, That is still kicking off all the threads at startup, I think. I have not thought this out exactly but I thought you would put a wait of some kind inside the for loop, so that the loop does not continue until the recursion is done. You would get a lot of depth and maybe several threads as it walks down the directory tree, but at least it would not create all the threads in one go. You can also investigate the threads by using Thread.Current.ID, maybe record how many threads are created with each type of model. — Dean, Sep 19 '14 at 11:30
@Dean, I know it does that but that's the whole problem. Doing it that way doesn't slow it down. It's really fast. Only Mode 2 (with AsParallel().ForAll() slows down to 1 step per second) — Devedse, Sep 19 '14 at 12:04
@Devedse. yes sorry I was off on a tangent there. My observations with Tasks and with the Parallel.ForEach methods is that they never create more than a handful of threads, typically around 4 or 8 with hyperthreading on a quad core machine. Then again, I've never dared venture into a recusive investigation! — Dean, Sep 19 '14 at 12:17
@Dean, yeah that's what I seem to be experiencing as well for Parallel.ForEach and new Tasks (even when using recursion). I hope someone here can explain why and how AsParallel.ForAll differs from them. (And why Microsoft does suggest using AsParallel.ForAll) — Devedse, Sep 19 '14 at 14:29
@Devedse If you want Parallel.Foreach in a non CPU bound task use MaxDegreeOfParallelism = num_of_cores it will save resources and be faster. in the AsParallel.ForAll it warns to not use it with IO bound tasks. — Pedro.The.Kid, Sep 22 '14 at 14:36
@Devedse Unrelated to the question but you have a closure problem with case 4 above https://stackoverflow.com/questions/271440/captured-variable-in-a-loop-in-c-sharp — A.Learn, Jul 23 '18 at 06:52

score 51 · Accepted Answer · edited May 23 '17 at 12:34

This problem is pretty debuggable, an uncommon luxury when you have problems with threads. Your basic tool here is the Debug > Windows > Threads debugger window. Shows you the active threads and gives you a peek at their stack trace. You'll easily see that, once it gets slow, that you'll have dozens of threads active that are all stuck. Their stack trace all look the same:

    mscorlib.dll!System.Threading.Monitor.Wait(object obj, int millisecondsTimeout, bool exitContext) + 0x16 bytes  
    mscorlib.dll!System.Threading.Monitor.Wait(object obj, int millisecondsTimeout) + 0x7 bytes 
    mscorlib.dll!System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) + 0x182 bytes    
    mscorlib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) + 0x93 bytes   
    mscorlib.dll!System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler scheduler, bool waitForCompletion) + 0xba bytes  
    mscorlib.dll!System.Threading.Tasks.Task.RunSynchronously(System.Threading.Tasks.TaskScheduler scheduler) + 0x13 bytes  
    System.Core.dll!System.Linq.Parallel.SpoolingTask.SpoolForAll<ConsoleApplication1.DirWithSubDirs,int>(System.Linq.Parallel.QueryTaskGroupState groupState, System.Linq.Parallel.PartitionedStream<ConsoleApplication1.DirWithSubDirs,int> partitions, System.Threading.Tasks.TaskScheduler taskScheduler) Line 172  C#
// etc..

Whenever you see something like this, you should immediately think fire-hose problem. Probably the third-most common bug with threads, after races and deadlocks.

Which you can reason out, now that you know the cause, the problem with the code is that every thread that completes adds N more threads. Where N is the average number of sub-directories in a directory. In effect, the number of threads grows exponentially, that's always bad. It will only stay in control if N = 1, that of course never happens on an typical disk.

Do beware that, like almost any threading problem, that this misbehavior tends to repeat poorly. The SSD in your machine tends to hide it. So does the RAM in your machine, the program might well complete quickly and trouble-free the second time you run it. Since you'll now read from the file system cache instead of the disk, very fast. Tinkering with ThreadPool.SetMinThreads() hides it as well, but it cannot fix it. It never fixes any problem, it only hides them. Because no matter what happens, the exponential number will always overwhelm the set minimum number of threads. You can only hope that it completes finishing iterating the drive before that happens. Idle hope for a user with a big drive.

The difference between ParallelEnumerable.ForAll() and Parallel.ForEach() is now perhaps also easily explained. You can tell from the stack trace that ForAll() does something naughty, the RunSynchronously() method blocks until all the threads are completed. Blocking is something threadpool threads should not do, it gums up the thread pool and won't allow it to schedule the processor for another job. And has the effect you observed, the thread pool is quickly overwhelmed with threads that are waiting on the N other threads to complete. Which isn't happening, they are waiting in the pool and are not getting scheduled because there are already so many of them active.

This is a deadlock scenario, a pretty common one, but the threadpool manager has a workaround for it. It watches the active threadpool threads and steps in when they don't complete in a timely manner. It then allows an extra thread to start, one more than the minimum set by SetMinThreads(). But not more then the maximum set by SetMaxThreads(), having too many active tp threads is risky and likely to trigger OOM. This does solve the deadlock, it gets one of the ForAll() calls to complete. But this happens at a very slow rate, the threadpool only does this twice a second. You'll run out of patience before it catches up.

Parallel.ForEach() doesn't have this problem, it doesn't block so doesn't gum up the pool.

Seems to be the solution, but do keep in mind that your program is still fire-hosing the memory of your machine, adding ever more waiting tp threads to the pool. This can crash your program as well, it just isn't as likely because you have a lot of memory and the threadpool doesn't use a lot of it to keep track of a request. Some programmers however accomplish that as well.

The solution is a very simple one, just don't use threading. It is harmful, there is no concurrency when you have only one disk. And it does not like being commandeered by multiple threads. Especially bad on a spindle drive, head seeks are very, very slow. SSDs do it a lot better, it however still takes an easy 50 microseconds, overhead that you just don't want or need. The ideal number of threads to access a disk that you can't otherwise expect to be cached well is always one.

Matthew Watson · Answer 2 · 2014-09-24T08:36:37.937

The first thing to note is that you are trying to parallelise an IO-bound operation, which will distort the timings significantly.

The second thing to note is the nature of the parallelised tasks: You are recursively descending a directory tree. If you create multiple threads to do this, each thread is likely to be accessing a different part of the disk simultaneously - which will cause the disk read head to be jumping all over the place and slowing things down considerably.

Try changing your test to create an in-memory tree, and access that with multiple threads instead. Then you will be able to compare the timings properly without the results being distorted beyond all usefulness.

Additionally, you may be creating a great number of threads, and they will (by default) be threadpool threads. Having a great number of threads will actually slow things down when they exceed the number of processor cores.

Also note that when you exceed the thread pool minimum threads (defined by ThreadPool.GetMinThreads()), a delay is introduced by the thread pool manager between each new threadpool thread creation. (I think this is around 0.5s per new thread).

Also, if the number of threads exceeds the value returned by ThreadPool.GetMaxThreads(), the creating thread will block until one of the other threads has exited. I think this is likely to be happening.

You can test this hypothesis by calling ThreadPool.SetMaxThreads() and ThreadPool.SetMinThreads() to increase these values, and see if it makes any difference.

(Finally, note that if you are really trying to recursively descend from C:\, you will almost certainly get an IO exception when it reaches a protected OS folder.)

NOTE: Set the max/min threadpool threads like this:

ThreadPool.SetMinThreads(4000, 16);
ThreadPool.SetMaxThreads(4000, 16);

Follow Up

I have tried your test code with the threadpool thread counts set as described above, with the following results (not run on the whole of my C:\ drive, but on a smaller subset):

Mode 1 took 06.5 seconds.
Mode 2 took 15.7 seconds.
Mode 3 took 16.4 seconds.

This is in line with my expectations; adding a load of threading to do this actually makes it slower than single-threaded, and the two parallel approaches take roughly the same time.

In case anyone else wants to investigate this, here's some determinative test code (the OP's code is not reproducible because we don't know his directory structure).

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;

namespace Demo
{
    internal class Program
    {
        private static DirWithSubDirs RootDir;

        private static void Main()
        {
            Console.WriteLine("Loading file system into memory...");
            RootDir = new DirWithSubDirs("Root", 4, 4);
            Console.WriteLine("Done");

            //ThreadPool.SetMinThreads(4000, 16);
            //ThreadPool.SetMaxThreads(4000, 16);

            var w = Stopwatch.StartNew();
            ThisIsARecursiveFunctionInMemory(RootDir);

            Console.WriteLine("Elapsed seconds: " + w.Elapsed.TotalSeconds);
            Console.ReadKey();
        }

        public static void ThisIsARecursiveFunctionInMemory(DirWithSubDirs currentDirectory)
        {
            var depth = currentDirectory.Path.Count(t => t == '\\');
            Console.WriteLine(depth + ": " + currentDirectory.Path);

            var children = currentDirectory.SubDirs;

            //Edit this mode to switch what way of parallelization it should use
            int mode = 3;

            switch (mode)
            {
                case 1:
                    foreach (var child in children)
                    {
                        ThisIsARecursiveFunctionInMemory(child);
                    }
                    break;

                case 2:
                    children.AsParallel().ForAll(t =>
                    {
                        ThisIsARecursiveFunctionInMemory(t);
                    });
                    break;

                case 3:
                    Parallel.ForEach(children, t =>
                    {
                        ThisIsARecursiveFunctionInMemory(t);
                    });
                    break;

                default:
                    break;
            }
        }
    }

    internal class DirWithSubDirs
    {
        public List<DirWithSubDirs> SubDirs = new List<DirWithSubDirs>();

        public String Path { get; private set; }

        public DirWithSubDirs(String path, int width, int depth)
        {
            this.Path = path;

            if (depth > 0)
                for (int i = 0; i < width; ++i)
                    SubDirs.Add(new DirWithSubDirs(path + "\\" + i, width, depth - 1));
        }
    }
}

Yes, filesystems are already optimized for sequential access - it will be faster. — , Sep 18 '14 at 08:52
Please see my update, I've changed the tree to be loaded into memory first but the results are still the same. — Devedse, Sep 18 '14 at 08:56
The main question I have is why Parallel.ForEach seems to be able to complete this code in 0.03 seconds while AsParallel().ForAll() takes days (+ having to create an extra thread per recursion level). And also why MSDN suggests the opposite? — Devedse, Sep 18 '14 at 09:57
I find it really strange. I've added exactly that code and still Mode 2 takes an almost infinite amount of time. (Still only going 1 step at a time) — Devedse, Sep 18 '14 at 12:41
@Devedse I note that you have opened a bounty for this, but the fact remains that I can't reproduce your results. The question is: Can anyone else? — Matthew Watson, Sep 24 '14 at 07:44
Interestingly, I came across this as I was testing some access exception skipping implementations of a GetAllDirectories function that return either a `List` (26.3 secs), an `IEnumerable` with a `ToList` (26.3 secs) and an `IEnumerable` with a `ToList` that does the recursion with `Parallel.ForEach` (8.5 secs). The `Parallel.ForEach` is 4 times faster. — NetMage, Oct 21 '17 at 00:03

score 4 · Answer 3 · edited Oct 01 '19 at 21:46

The Parallel.For and .ForEach methods are implemented internally as equivalent to running iterations in Tasks, e.g. that a loop like:

Parallel.For(0, N, i => 
{ 
  DoWork(i); 
});

is equivalent to:

var tasks = new List<Task>(N); 
for(int i=0; i<N; i++) 
{ 
tasks.Add(Task.Factory.StartNew(state => DoWork((int)state), i)); 
} 
Task.WaitAll(tasks.ToArray());

And from the perspective of every iteration potentially running in parallel with every other iteration, this is an ok mental model, but does not happen in reality. Parallel, in fact, does not necessarily use one Task per iteration, as that is significantly more overhead than is necessary. Parallel.ForEach tries to use the minimum number of tasks necessary to complete the loop as fast as possible. It spins up tasks as threads become available to process those tasks, and each of those tasks participates in a management scheme (I think its called chunking): A task asks for multiple iterations to be done, gets them, and then processes that work, and then goes back for more. The chunk sizes vary based the number of tasks participating, the load on the machine, etc.

PLINQ’s .AsParallel() has a different implementation, but it ‘can’ still similarly fetch multiple iterations into a temporary store, do the calculations in a thread (but not as a task), and put the query results into a small buffer. (You get something based on ParallelQuery, and then further .Whatever() functions bind to an alternative set of extension methods that provide parallel implementations).

So now that we have a small idea of how these two mechanisms work, I will try to provide an answer to your original question:

So why is .AsParallel() slower than Parallel.ForEach? The reason stems from the following. Tasks (or their equivalent implementation here) do NOT block on I/O-like calls. They ‘await’ and free up the CPU to do something else. But (quoting C# nutshell book): “PLINQ cannot perform I/O-bound work without blocking threads”. The calls are synchronous. They were written with the intention that you increase the degree of parallelism if (and ONLY if) you are doing such things as downloading web pages per task that do not hog CPU time.

And the reason why your function calls are exactly analogous to I/O bound calls is this: One of your threads (call it T) blocks and does nothing until all of its child threads have finished, which can be a slow process here. T itself is not CPU-intensive while it waits for the children to unblock, it is doing nothing but waiting. Hence it is identical to a typical I/O bound function call.

Let me add that I am 100% in agreement with Hans Passant in his post if you really are writing a disk tree parser. While it may appear to be an I/O bound process from the persepctive of the CPU, the disk can only do one request at a time, and you will be flooding it with dozens of simultaneous read requests. If you were doing something like downloading web pages, then ok that would be each web page being serviced by a different server. What you are effectively doing is akin to a denial of service attack on your disk. — Dean, Sep 24 '14 at 21:15
Hey Dean, can you take a look at the comment I left under Hans' answer? In short, I wonder OP can still benefit from _some_ degree of parallelism, as long as it's kept small, without DoS'ing his storage. — Marc.2377, Oct 01 '19 at 21:49

score 0 · Answer 4 · edited May 23 '17 at 10:31

0

Based on the accepted answer to How exactly does AsParallel work?

.AsParallel.ForAll() casts back to IEnumerable before calling .ForAll()

so it creates 1 new thread + N recursive calls (each of which generates a new thread).

edited May 23 '17 at 10:31

Community

1
1

answered Sep 18 '14 at 09:24

What I find strange is that Parallel.ForEach doesn't do this which makes it so much faster. (AsParallel.ForAll takes 1 second per node, so with 100000 nodes it would more then a day to complete) compaed to 0.03 seconds with Paralle.ForEach – Devedse Sep 18 '14 at 09:52
1

Well I guess that's because Parallel.ForEach has full control of the iteration. Maybe it's just written better? – Sep 18 '14 at 14:03

Why is Parallel.ForEach much faster then AsParallel().ForAll() even though MSDN suggests otherwise?

4 Answers4

Linked