Threading: Most efficient way for many repeated parallel sweeps over a small array?

Question

I'm optimizing a solver (systems of linear equations) whose most critical part consists of

Many (1000+) short (~10-1000 Microseconds) massively parallel (128 threads on 64 CPU cores) sweeps over small (CPU cache size) arrays, pseudocode:

for(i=0;i<num_iter;i++)
{
  // SYNC-POINT
  parallel_for(j=0;j<array_size;j++) 
    array_out[j] = some_function( array_in[j] )
  swap( array_in, array_out ); 
}

Unfortunately, the standard parallelization constructs provided by OMP or TBB I tried so far (serial outer loop plus parallel inner loop, e.g. via tbb::parallel_for) doesn't seem to handle this extremly fine grained parallelism very well, because the thread libraries' setup cost for the inner loop seems to dominates the time spent within the relatively short inner loop. (Note that very fine grained inner loops are crucial for the total performance of the algorithm because this way all data can be kept in L2/L3 CPU cache))

EDIT to address answers,questions & comments so far:

The example is just pseudo code to illustrate the idea. The actual implementation takes care about false sharing by padding ARRAY lines with CPU cache-line.
some_func(array_in, j) is a simple stencil that accesses the current point j and a small neighborhood around it, e.g. sume_func( array, j ) = array[j-1]+array[j]+array[j+1];

The answer given by Jérôme Richard touches a very intersting point about barriers ( here is IMO the root of the problem). It is suggested to "replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key". Interesting but very general. How exactly would this be accomplished in this case ? Does "point-to-point-neighbor synchronization" involve an atomic primitive for every entry of the array ?

Why don't you just swap the loops? Each thread can do `num_iter` function calls per array element. — Goswin von Brederlow, Jul 24 '22 at 11:46
Could there be false sharing and therefore cache thrashing due to the small `array_size`? — paleonix, Jul 24 '22 at 12:05
re. false sharing: the example is just pseudo code to illustrate the idea. The actual implementation takes care about false sharing by padding ARRAY lines with CPU cache-line. Furthermore the dependency is not only array_out[i] = some_func(array_in[j]) but also accesses a small neighborhood around the current point in array_in[] (7-point stencil ni 3D) — zx-81, Jul 25 '22 at 08:31
@zx-81 Please add such details to the question as an edit, so future readers don't have to read comments. — paleonix, Jul 25 '22 at 09:39

Jérôme Richard · Answer 1 · 2022-07-24T13:24:54.150

The general solution to this problem is to create the threads and distribute the work only once, and then use fast synchronization point in the threads. In this case, the outer loop is moved in the threaded function. This is possible with the TBB library by providing a range (tbb::blocked_range<size_t> ) and a function to tbb::parallel_for (see an example here).

Barriers are known to scale poorly on many core architectures, especially in such codes. The only way to make the program scale is to reduce the synchronization between threads so to make it more local. For example, for stencils, the solution is to replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key to solve such problem. In fact, note the fundamental laws of physics prevent barriers to scale because clocks cannot be fully synchronized in general relativity and computers (unfortunately) obeys to physics law.

Many core systems are now nearly always NUMA ones. Regarding your configuration, you certainly use an AMD Threadripper processor which have multiple NUMA nodes. This means you should care about locality and the NUMA allocation policy. The default policy is generally the first touch. This means that is your initialization is sequential or threads are mapped differently, then cores have to fetch data from remote NUMA nodes which is slow. In the worst case, all cores can access to the same NUMA node and saturate it resulting in a possibly slower execution than the sequential version. You should generally make it parallel for better performance. Getting high-performance on such architecture is far from being easy. You need to carefully control the allocation policy (numactl can help for that), the initialization (parallel), the thread binding (with taskset, hwloc and/or environment variables) and the memory access pattern (by reading articles/books about how NUMA machines work and applying dedicated algorithms).

By the way, there is probably a false-sharing effect happening in your code because cache lines of array_out are certainly shared between thread. This should not have a critical impact but it does contribute to get a poor scalability (especially on your 64-core processor). The general solution to this problem is to align the array in memory on a cache line and take take the parallel splitting is done on a cache line boundary. Alternatively, you can allocate the array part in each thread. This is generally a better approach as is ensure data is locally allocated, locally filled and make data-sharing/communication between NUMA nodes and even cores more explicit (ie. better control), though it can make the code more complex (there is no free lunch).

Thanks fro the answer. However, the linked page just contains information about standard TBB parallel_for() which I already tried. Your comments about barriers are interesting because here is IMO the root of the problem. You wrote: " replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key" This is interesting but very general. could you elaborate this in more detail ? — zx-81, Jul 25 '22 at 08:26
This was voluntarily general because `some_function` is general and without more information, I cannot be more precise. This function determine the dependencies which are critical for determining the best synchronization pattern. If `some_function` requires the full array, then a barrier is needed, but this is often not the case (and in fact, if it would, then it will not scale because of this anyway). — Jérôme Richard, Jul 25 '22 at 13:23

score 0 · Answer 2 · answered Jul 24 '22 at 11:41

Sharing data across threads is slow. This is because each CPU core has at least one layer of personal cache. The minute you share data between cores/threads, the personal caches need to be synchronised which is slow.

Threads running in parallel across different cores work best if they do not share data.

In your case, if data is read only you might be best off giving each thread its own copy of the data. For read write data, you have to accept the synchronisation slowdown.

Threading: Most efficient way for many repeated parallel sweeps over a small array?

2 Answers2