I'm optimizing a solver (systems of linear equations) whose most critical part consists of
Many (1000+) short (~10-1000 Microseconds) massively parallel (128 threads on 64 CPU cores) sweeps over small (CPU cache size) arrays, pseudocode:
for(i=0;i<num_iter;i++)
{
// SYNC-POINT
parallel_for(j=0;j<array_size;j++)
array_out[j] = some_function( array_in[j] )
swap( array_in, array_out );
}
Unfortunately, the standard parallelization constructs provided by OMP or TBB I tried so far (serial outer loop plus parallel inner loop, e.g. via tbb::parallel_for) doesn't seem to handle this extremly fine grained parallelism very well, because the thread libraries' setup cost for the inner loop seems to dominates the time spent within the relatively short inner loop. (Note that very fine grained inner loops are crucial for the total performance of the algorithm because this way all data can be kept in L2/L3 CPU cache))
EDIT to address answers,questions & comments so far:
The example is just pseudo code to illustrate the idea. The actual implementation takes care about false sharing by padding ARRAY lines with CPU cache-line.
some_func(array_in, j) is a simple stencil that accesses the current point j and a small neighborhood around it, e.g. sume_func( array, j ) = array[j-1]+array[j]+array[j+1];
The answer given by Jérôme Richard touches a very intersting point about barriers ( here is IMO the root of the problem). It is suggested to "replace barriers by local point-to-point neighbor synchronizations. Using task-based parallel runtimes can help to do that easily. Weaker synchronization patterns are the key". Interesting but very general. How exactly would this be accomplished in this case ? Does "point-to-point-neighbor synchronization" involve an atomic primitive for every entry of the array ?