How to temporarily synchronize in a Parallel.ForEach loop?

Question

So here's a minimal version of code that works, but is inefficient:

Parallel.ForEach(list, x =>
{
    doThing1(x);
});

Thing1Done = true;

Parallel.ForEach(list, x =>
{
    doThing2(x);
});

Thing2Done = true;

Parallel.ForEach(list, x =>
{
    doThing3(x);
});

Thing3Done = true;

Intuitively, I'd like to run all 3 "things" within the same loop, but they must be able to synchronize temporarily to update the respective Thing*n*Done property.

Pseudocode for this idea as follows:

Parallel.ForEach(list, x =>
{
    doThing1(x);
    // wait for doThing1 to be completed for all other elements in list
    Thing1Done = true;

    doThing2(x);
    // wait for doThing2 to be completed for all other elements in list
    Thing2Done = true;

    doThing3(x);
    // wait for doThing3 to be completed for all other elements in list
    Thing3Done = true;
});

So for example, it's necessary that doThing1() finishes its execution for every member of list before Thing1Done is set to true. doThing2() can only begin after Thing1Done has been set.

Each individual step is not overly expensive, and I'm concerned about the overhead involved with the naive approach. What is the best way to efficiently solve this task, assuming that the overhead involved in initializing the threads is my largest concern? I'd also like to avoid busy waiting (where the thread spins in a while loop doing nothing useful until some flag is set to true) if at all possible.

I am willing to write something more general if the Parallel library cannot do what I want.

That's not going to work -- `Parallel.ForEach` doesn't spawn a new thread for every item in the list, rather it re-uses a small number of threads. Blocking all of those threads will bring the whole system to a screeching halt before it's managed to start processing all items. Your question assumes it uses a thread per item. — canton7, Apr 20 '21 at 09:39
What are you using `Thing1Done`, etc, for? Is it critical that all items complete Thing1 before any of them start Thing2? — canton7, Apr 20 '21 at 09:40
I'm not sure, what you need the flags for? Both snippets ensure that for any item x, the three functions are called in order. What am I missing? — Fildor, Apr 20 '21 at 09:42
If you want more of a "pipeline"-like behavior, have a look into [DataFlow](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library). — Fildor, Apr 20 '21 at 09:43
@canton7 That assumption does not matter. The intent is clear, whether or not I consider each function to be running on mutually independent threads or not. And yes, it's critical that Thing1Done is updated before any of them start on Thing2. This is an extremely simplified version of the actual problem, of course. — sausageberg, Apr 20 '21 at 09:51
@sausageberg Are you sure? You can only "wait for other threads" if each item is on its own thread, surely? Think of the edge case where `Parallel.ForEach` only decides to use 1 thread (which can happen, IIRC?). In that case, there are no "other threads" to wait for — canton7, Apr 20 '21 at 09:51
@canton7 I think you're choosing to be needlessly pedantic about this. Like I said, the intent is clear, whether or not the pseudocode accurately represents the solution. — sausageberg, Apr 20 '21 at 09:53
@Fildor This is a minimal version of what I'm trying to do. I'm doing something more complicated than just setting flags, but that wouldn't make for a very readable post. — sausageberg, Apr 20 '21 at 09:54
@sausageberg I'm trying to understand what you actually mean by "wait for other threads" -- as it's stated, what you want may be clear in your head, but you're trying to explain it by using an example which doesn't make much sense. Help us to help you, please. For example, in a parallel processing scenario, there's a good chance that the last time is only started once all the other items are finished. Is that acceptable to you? — canton7, Apr 20 '21 at 09:54
@canton7 Again, like I said, the intent is clear. You are reading way too much into the comments in my pseudocode. But if you wish for clarification, read the paragraph above it. I just want to ensure that doThing1() has finished executing for all members of 'list' before Thing1Done is set to true. — sausageberg, Apr 20 '21 at 09:58
It's not clear to me, which is why I'm asking. Repeatedly stating "it's clear" doesn't help us! You're asking for free help here -- it's expected that you put the effort into clearly explaining yourself — canton7, Apr 20 '21 at 09:59
Please replace the comments `// wait for other threads` with something more descriptive of your intentions about the actual problem you are trying to solve. Like `// wait for the completion of the 'doThing1' phase for all items in the list`. Describing it based on implementation details of a mechanism (`Parallel.ForEach`) that you don't know exactly how it works, can only cause confusion. — Theodor Zoulias, Apr 20 '21 at 10:12
@Theodor Zoulias I know how Parallel.ForEach works. I've painfully coded multithreaded applications in C before, so I'm familiar with threading without all this abstraction. But to at least prevent anymore pedantic comments, I've changed it as requested. — sausageberg, Apr 20 '21 at 10:18
Aha. So I guess that you know that the `Parallel.ForEach` method [saturates the `ThreadPool`](https://stackoverflow.com/questions/66261010/multiple-parallel-foreach-loops-in-net/66263583#66263583) when used without specifying explicitly the `MaxDegreeOfParallelism` option, right? — Theodor Zoulias, Apr 20 '21 at 10:28
@sausageberg _"I'm doing something more complicated than just setting flags"_ In that case, I am even more confident, DataFlow may be for you. I imagine one block for each "stage" processing the (or "a") list in parallel, then hand it on to the next. If you need "things done or signalled" in between , you can make that action blocks, too. This approach even has the advantage, that you wouldn't need to wait for the whole process to finish. You can just submit lists (being your "workitem") to the pipeline and collect the results. — Fildor, Apr 20 '21 at 11:23
@TheodorZoulias Yeah I'm aware of that. And as your other solution pointed out, it's far too costly to spin up a new thread for every single item, which is unfortunate since it would make otherwise make this quite easy. — sausageberg, Apr 20 '21 at 11:34
Out of curiosity, what kind of work are the `doThing1`, `doThing2` and `doThing3` doing? Is it [CPU-bound or I/O-bound](https://stackoverflow.com/questions/868568/what-do-the-terms-cpu-bound-and-i-o-bound-mean), or a mix of both? — Theodor Zoulias, Apr 20 '21 at 12:09
Maybe [`Task.WhenAll`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.whenall?view=net-5.0) can help? Start a task for each `doThing` function and use `WhenAll` to continue? — asaf92, Apr 20 '21 at 12:20
@sausageberg what do those methods do? Why do you assume the first attempt isn't efficient? `Parallel.ForEach` is a *specialized* methods used only for data parallelism. It partitions the data into roughly as many partitions as there are cores and uses one worker task to crunch the data in each partition. If you have a lot of data for in-memory crunching, your very first example is as efficient as possible — Panagiotis Kanavos, Apr 20 '21 at 12:58
@sausageberg on the other hand, if you can break your process into steps that can run in a pipeline, DataFlow is the correct choice. DataFlow uses the same CSP model used eg by Go's channels. It's **very important** to specify what you *really* want to do, in order to select the correct processing model. `Parallel.ForXYZ` is only built for *unsynchronized* crunching of lots of data. — Panagiotis Kanavos, Apr 20 '21 at 13:03
OK, so you are using the correct tool for the job. I wonder why you dislike the multiple-parallel-loops approach. Is it based purely on considerations about efficiency, or there are also architectural reasons that make the single-parallel-loop approach more appealing? — Theodor Zoulias, Apr 20 '21 at 14:15

score 6 · Answer 1 · answered Apr 20 '21 at 10:58

You could use the Barrier threading syncronization primitive:

var barrier = new Barrier(list.Count);

var options = new ParallelOptions()
{
    MaxDegreeOfParallelism = list.Count,
    TaskScheduler = new ThreadPerTask()
};

Parallel.ForEach(list, options, x =>
{
    doThing1(x);
    // wait for doThing1 to be completed for all other elements in list
    barrier.SignalAndWait();

    doThing2(x);
    // wait for doThing2 to be completed for all other elements in list
    barrier.SignalAndWait();

    doThing3(x);
    // wait for doThing3 to be completed for all other elements in list
    barrier.SignalAndWait();
});

The SignalAndWait method signals the completion of an item, and waits until all items have completed (until the ParticipantsRemaining property becomes zero). After that the CurrentPhaseNumber property is incremented, all threads are unblocked simultaneously, and are free to race towards the next milestone.

This is an extremely inefficient way to process the items of the list, because it requires a dedicated thread per item. You will need a custom TaskScheduler in order to implement this setup, like the one shown below:

public class ThreadPerTask : TaskScheduler
{
    protected override void QueueTask(Task task)
    {
        new Thread(() => this.TryExecuteTask(task))
        {
            IsBackground = true
        }.Start();
    }

    protected override bool TryExecuteTaskInline(Task task,
        bool taskWasPreviouslyQueued) => false;

    protected override IEnumerable<Task> GetScheduledTasks() { yield break; }
}

IMHO your first approach, the one that uses multiple consecutive Parallel.ForEach loops, is the correct way to solve this problem.

score 1 · Answer 2 · answered Apr 20 '21 at 11:05

As mentioned in the comments, your second example will have enormous overhead since it would require using one thread per item, and that will probably be more threads than are available in the threadpool, so new threads will need to be created, but the rate new threads are created are limited. So I would expect terrible performance. At least assuming you have many items to process.

If the requirements are that doThing1 need to have been completed for all items before doThing2 starts, then I do not think you can do much better than multiple sequential parallel-loops. I would not expect the overhead of this approach to be all to bad, since it will use the threadpool rather than spawning new threads.

It is possible a custom partitioner might help, or pre-split your list into chunks and process each chunk in parallel. By default parallel loops try to split the work into partitions, and try to adapt the size of these partitions to balance overhead with process-utilization. But If you have prior knowledge of the kind of work to do you can probably do a better job yourself. As always when talking about performance, you will probably need to measure to see what alternative is faster.

If you split the work into chunks similar to the number of processors you could perhaps use a model similar to your second example. But I'm unsure if it has any actual advantages.

How to temporarily synchronize in a Parallel.ForEach loop?

2 Answers2