How to use optimal parallelization for multi template matching with OpenCvSharp?

Question

For a multi Template Matching with OpenCvSharp I want to use Parallel.ForEach() and limit the maximal number of threads used to n-1 so at least one thread remains free for other possible occuring tasks (n is the total nr of CPU-level threads available).

E.g.: My system has a 4 Core CPU with 2 Threads per Core. So the max amount here should be 7.

How can I do this without hardcoding it? (It should work also for other PCs with another number of threads)

   // Use a maximum of n-1 threads
   int maxNrOfThreads = 7;

   Parallel.ForEach(this.Symbols.Keys, new ParallelOptions { MaxDegreeOfParallelism = maxNrOfThreads }, key  => {
      //EXECUTION (template match every symbol to a given image)
   });

Here is the execution part for who may be interessted:

Mat grayTemplate = this.Symbols[key].GrayscaledSymbol;

Mat res = new Mat(grayImage.Rows - grayTemplate.Rows + 1,grayImage.Cols - grayTemplate.Cols + 1, MatType.CV_32FC1);

Cv2.MatchTemplate(grayImage, grayTemplate, res, TemplateMatchModes.CCoeffNormed);

Cv2.Threshold(res, res, MIN_ACCURACY, 1.0, ThresholdTypes.Tozero);

while (true)
{
   double minval, maxval, threshold = MIN_ACCURACY;
   OpenCvSharp.Point minloc, maxloc;
   Cv2.MinMaxLoc(res, out minval, out maxval, out minloc, out maxloc);

   if (maxval >= threshold)
   {
       // Add bounding box to result object
       ret.AddResult(key,maxloc.X,maxloc.Y,grayTemplate.Width,grayTemplate.Height);

      // Fill in the res Mat so you don't find the same area again in the MinMaxLoc
      Rect outRect;
      Cv2.FloodFill(res, maxloc, new Scalar(0), out outRect, new Scalar(0.1), new Scalar(1.0));
   }
   else
   {
      break;
   }
}

Why should it be 7? `Parallel.ForEach` will detect the current number of cores, start with that number of tasks *and adjust itself based on the actual load*. No matter how many worker tasks (not threads) are used, the OS won't let other processes starve. You don't have to do anything unless you want to explicitly throttle or increase the number of threads. — Panagiotis Kanavos, Jul 25 '22 at 15:11
`ProcessorCount` will include "virtual" cores if Hyperthreading is used, so the Core-7 heuristic is meaningless. `Parallel.ForEach` uses cooperative multitasking so no task can monopolize a core. Unless the `Action` itself takes too long. So what are you trying to do? And what does `Execution` do? — Panagiotis Kanavos, Jul 26 '22 at 06:52
The DOP depends more on the number of partitions than the MaxDOP. If you pass an array or `IList<>` derived container, the `Count` will be used to partition the data into static ranges. After that, each task will go to work on a single partition. If you have 7 partitions, you won't get more than 7 concurrent tasks. In any case you never get *threads*, you get Tasks that run from `100ms up to 100 +50*(ProcCount)` before yielding. Unless your `Action` is taking longer than that — Panagiotis Kanavos, Jul 26 '22 at 07:00

score 4 · Accepted Answer · answered Jul 25 '22 at 15:07

4

You're looking for Environment.ProcessorCount.

answered Jul 25 '22 at 15:07

Blindy

65,249
10
91
131

There's usually no reason to check the processor count. `Parallel.ForEach` itself uses that value *but* will adjust the number of worker tasks based on load. The OS won't let other processes starve either. – Panagiotis Kanavos Jul 25 '22 at 15:14
@PanagiotisKanavos the `Parallel.ForEach` doesn't adjust the number of worker tasks based on load. It just grabs aggressively all the `ThreadPool` threads that are currently available. For example if the `ThreadPool` happens to have 100 available threads at the moment, the `Parallel.ForEach` will use all these 100 threads, and will ask for more. IMHO this is terrible, and makes specifying explicitly the `ParallelOptions.MaxDegreeOfParallelism` a very good idea. – Theodor Zoulias Jul 25 '22 at 15:23
1

`Parallel.ForEach` in general is terrible because it doesn't understand `Task` objects. `Task.WhenAll` is an almost drop-in replacement that understands modern async code, and coupled with correct functions on your end will prevent starvation in a much more efficient way. That said, you should be telling OP that, this isn't news to me. – Blindy Jul 25 '22 at 15:25
@Blindy quite the opposite. It *uses* tasks instead of threads. You refer to asynchronous operations instead, for which it was explicitly *not* built. That's not the method's fault. It was explicitly built for in-memory data parallelization and works to optimize *that* scenario - it partitions the source data, then uses a worker task per partition, to ensure minimal synchronization between workers. This is clear and explicit since 2010. It definitely doesn't use the threadpool to decide how many partitions to use. `Task.WhenAll` isn't even close – Panagiotis Kanavos Jul 25 '22 at 15:27
The [documentation](https://learn.microsoft.com/en-us/previous-versions/msp-n-p/ff963552(v=pandp.10)#controlling-the-degree-of-parallelism) says something different from any of you. "The degree of parallelism is automatically managed by the underlying components of the system; the implementation of the Parallel class, the default task scheduler, and the .NET thread pool all play a role in optimizing throughput under a wide range of conditions." – Ben Voigt Jul 25 '22 at 15:29
Blindy the `Task.WhenAll` lacks important features of the `Parallel.ForEach`. It offers no way to limit the degree of concurrency, and it doesn't support an early exit strategy. Starting from .NET 6 an async-enabled version of `Parallel.ForEach` exists. It's the [`Parallel.ForEachAsync`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreachasync). – Theodor Zoulias Jul 25 '22 at 15:31
@Theodor Zoulias it has an early exit mechanism, it's called `CancellationToken`. And my point is that the need of changing the degree of concurrency by hand **is a direct consequence of questionable code design**, which `Task.WhenAll` and proper async code helps avoid. – Blindy Jul 25 '22 at 15:34
@BenVoigt that was the point I was trying to make before things got derailed. The doc link you provided should be accompanied by the [partitioners article](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/custom-partitioners-for-plinq-and-tpl). Unfortunately, the way the docs are written right now, only people that already know these things exist can find them – Panagiotis Kanavos Jul 25 '22 at 15:38
@BenVoigt the documentation of the `Parallel.ForEach` is misleading. It gives the impression that a highly intelligent and sophisticated AI component will manage the utilization of your threads. In reality it's almost as dumb as it gets. All that an unconfigured `Parallel.ForEach` really knows is how to saturate your `ThreadPool`. Microsoft messed up with the default `Parallel.ForEach` configuration, and they didn't repeat the same error years later when they invented PLINQ and `Parallel.ForEachAsync`. But they haven't bothered to fix the `Parallel.ForEach` documentation either. – Theodor Zoulias Jul 25 '22 at 15:40
Blindy sure, with a `CancellationTokenSource` and a `SemaphoreSlim` you can make the `Task.WhenAll` imitate the `Parallel.ForEach` pretty well. But at that moment you have just reinvented an inferior version of `Parallel.ForEachAsync`. – Theodor Zoulias Jul 25 '22 at 15:44
@TheodorZoulias: I don't see any implied AI. The documentation does claim that some heuristics are in place. If that isn't true, have documentation bugs been filed? – Ben Voigt Jul 25 '22 at 16:53
@PanagiotisKanavos: That page says very little about the defaults. I could find only "By default when it is passed an IList or an array, PLINQ always uses range partitioning without load balancing." which says nothing about whether the number of processors influences the default number of partitions. – Ben Voigt Jul 25 '22 at 16:56
@BenVoigt that's a fair point. No, I haven't filed a [bug report](https://github.com/dotnet/dotnet-api-docs/issues) for the current state of the `Parallel.ForEach` documentation. Maybe I should. – Theodor Zoulias Jul 25 '22 at 17:25
@BenVoigt there are no simple defaults and they involve randomized timeouts. All `ForEach` and `For` methods end up calling a [TaskReplicator.Run](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs#L2866) method. That method's delegate runs inside a task and processes items from a partition until a timeout is reached. When that expires, the method exits and a new task starts with the remainder of the partition. – Panagiotis Kanavos Jul 25 '22 at 17:54
@BenVoigt these timeouts [are small](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/TaskReplicator.cs#L164), from 100ms up to 50ms*(ProcessorCount-1). This could get interesting in a manycore machine. That alone means no task can hog a single core (unless it's #255 on one of the Lenovo behemoths). It also means that if one partition finishes, no new tasks will be created. – Panagiotis Kanavos Jul 25 '22 at 17:57
@BenVoigt so instead of spawning 100 task that run indefinitely causing thrashing, it seems the `Parallel` implementation starts relatively short tasks to process data over and over. If the user's Action takes too long, more small tasks (replicas) will be created. The maximum number of concurrent tasks/Replicas seems to be controlled by the number of partitions. Which, in case of `IEnumerable<>` sources, is dynamic – Panagiotis Kanavos Jul 25 '22 at 18:00
@PanagiotisKanavos The main problem I found with `Parallel.ForEach`, which I hit just recently, is as you say the constant regurgitation of the tasks (invoking the `TLocal` function again and again, which was a heavy function). In my case it was 100% CPU-bound work, and it went *much much* faster just partitioning the data into `ProcessorCount` partitions and spawning exactly that many tasks to work them. – Charlieface Jul 25 '22 at 23:46
@Charlieface `a heavy function` that was the problem. Data parallelism means doing stuff on lots of data not doing heavy stuff on little data. In *my* case, in a 2K*5K*1K matrix multiplication I got 9x speedup on a 6-core machine. 6 for the cores and 3 for hyperthreading. `ProcessorCount` is 12 which is clearly *not* true. On top of that, using IList/array instead of IEnumerable results in static partitioning based on the item count. With IEnumerable this isn't possible and dynamic partitioning is used. – Panagiotis Kanavos Jul 26 '22 at 06:44
@BenVoigt for static partitioning, the ProcessorCount *is* used to [define the number of partitions and thus worker tasks](https://github.com/dotnet/runtime/blob/main/src/libraries/System.Threading.Tasks.Parallel/src/System/Threading/Tasks/Parallel.cs#L961) – Panagiotis Kanavos Jul 26 '22 at 07:03
@all: Offtopic: What is now the fastest method for parallelizing without getting the system to freeze or allocate CPU power(threads, workers,etc.) that the system needs somewhere else? (This is actually the main reason I was asking the question) – Eduard Szilaghi Jul 26 '22 at 11:50
@EduardSzilaghi that's something that you might want to ask in a separate question. We were carried away here. Comments are intended for asking for clarifications and suggesting improvements for the question or the answer, not for answering questions or extended discussions. – Theodor Zoulias Jul 26 '22 at 15:48
@EduardSzilaghi processor affinity would help you there. – Blindy Jul 26 '22 at 16:19
@BenVoigt I've opened a new issue on GitHub about this subject [here](https://github.com/dotnet/runtime/issues/72981 "ParallelOptions.MaxDegreeOfParallelism documentation, is the given advice correct?"). Feel free to share your opinion if you wish! – Theodor Zoulias Jul 28 '22 at 13:45

How to use optimal parallelization for multi template matching with OpenCvSharp?

1 Answers1