1

I'm dealing with rather large data sets and am testing different approaches to get the fastest output. At this current stage in testing I have nested parallel loops like so:

Parallel.ForEach(listOne, itemOne =>
{
    Parallel.ForEach(listTwo, itemTwo =>
    {           
        Parallel.ForEach(listThree, itemThree =>
        {
            //do stuff
        });
    });
});

This is obviously just an example of how my code is being used. What would be the best way (if any) to set a MaxDegreeOfParallelism value so that it could be utilized by each of the loops?

I understand I can set a value for the MaxDegreeOfParallelism in each of the loops options, but I'm not sure how that would be as a final result. And finally, would setting a MaxDegreeOfParallelism value even be necessary/useful in this scenario?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
SamNewton
  • 72
  • 5
  • 3
    Often you'll find that just parallelizing the outer loop is sufficient to achieve good performance. For example, when parallel processing an image, it makes sense to process all rows of pixels in parallel, but for each row, process the columns sequentially. In general, partitioning the work in to very small chunks (and therefore potentially many threads) doesn't pay off properly unless you use Tasks. It depends how many items `listOne` has in relation to the number of processors you have and if the lists within (`listTwo` & `listThree`) are generally balanced for each `itemOne`. – Wyck Mar 20 '23 at 18:47
  • It's a very large dataset. At first I only did the outer loop but only saw about a performance improvement by about 25%. So this is just phase two of the testing with this data set to compare results. – SamNewton Mar 20 '23 at 18:49
  • 4
    It's important to identify the bottleneck. Is your program bound by memory speed? (adding threads won't help). Bound by IO? (threads won't help). Bound by network bandwidth? (threads won't help). Bound by arithmetic/processing? (Threads to the rescue!) – Wyck Mar 20 '23 at 18:51
  • Related question: [Nested Parallel.ForEach loops](https://stackoverflow.com/questions/12618338/nested-parallel-foreach-loops). It doesn't contain quality answers though. – Theodor Zoulias Mar 20 '23 at 22:47
  • 2
    Regarding the `MaxDegreeOfParallelism`, here are some relevant questions: [What does MaxDegreeOfParallelism do?](https://stackoverflow.com/questions/9538452/what-does-maxdegreeofparallelism-do), [Does Parallel.ForEach limit the number of active threads?](https://stackoverflow.com/questions/1114317/does-parallel-foreach-limit-the-number-of-active-threads), [Multiple Parallel.ForEach loops in .Net](https://stackoverflow.com/questions/66261010/multiple-parallel-foreach-loops-in-net). – Theodor Zoulias Mar 20 '23 at 22:57
  • @TheodorZoulias as the master of multithreading in C# SO, only yes or no for 500 points and beat the game: is nested parallels futile? – Leandro Bardelli Mar 21 '23 at 00:57
  • 1
    you could do a cartesian join of the 3 lists into one list, then apply the parallel on the single list it generates? - Example https://stackoverflow.com/questions/5004943/how-do-i-take-the-cartesian-join-of-two-lists-in-c – Ctznkane525 Mar 21 '23 at 00:58
  • Leandro I would advice against nesting `Parallel` loops. I've just posted [an answer](https://stackoverflow.com/questions/12618338/nested-parallel-foreach-loops/75796832#75796832) to the older question. – Theodor Zoulias Mar 21 '23 at 02:29
  • Before doing anything else I would highly recommend that you do some *profiling* of your application. For a random piece of unoptimized software it is not uncommon that it spends a large amount of time doing unnecessary or repeated work, or use a poor algorithm or datastructure for the particular purpose. Without profiling it is really difficult to tell where the time actually goes. – JonasH Mar 21 '23 at 07:51
  • What does `do stuff` do and what is `listOne`? It matters a *lot*. It's not enough to just use a lot of cores. Data locality matters too. Adjacent entries in an array or `List<>` are adjacent in memory too and can end up together in the CPU's cache. Bad parallelization can cause *worse* performance. In fact, it may be faster to create a local copy of the data, eg a column's data, into a buffer so that access is local. – Panagiotis Kanavos Mar 24 '23 at 09:40
  • @SamNewton in [this question](https://stackoverflow.com/questions/72678263/parallel-vs-serial-results-arent-understandable/73042954#73042954) someone doubted that parallelization worked because their triply nested PFOR loops offered little speedup so I created a benchmark to test multiple ways to parallelize matrix multiplication. Using local copies resulted in >200% speedup. The final loop is 9 times faster than the serial one on a 6-core hyperthreaded CPU, so close to the max. It can probably be improved further with SIMD vectorization for another 2x-4x, depending on the number types and CPU – Panagiotis Kanavos Mar 24 '23 at 09:46

0 Answers0