OpenMP reduce array without thread-local copies

Question

I wish to OpenMP reduce a large array into a smaller dynamic array. For example, where

large[] = {1, 1, 1, 2, 2, 2, 3, 3, 3};

// OpenMP reduce length-3 sublists of large, to produce:

small[] = {3, 6, 9};

My requirements are similar to this question, but I have important additional constraints:

must support OpenMP 3.1 (so I cannot use OpenMP 4.5's array reductions, as this answer does)
cannot have a thread-private copy of small (like this answer), since small can be just as large as large, and hence thread-private copies may cause a stack overflow
large array must be iterated in order, for good caching (since it may be very large)

Some other details:

the summed elements of large are not necessarily adjacent
this reduction happens within a function. The small (output) array is pre-allocated by the user, and passed by reference. Ideally, the function should not allocate a new local copy of small
both small and large have lengths of a power of 2 (e.g. 2, 4, 8, 16 ...). Every element of small is reduced from the same number of large elements (another power of 2).

Here is example serial psuedocode, for clarity:

void myfunc(double* small, double* large, int lLen, // other args) {

    for (int i=0; i<lLen; i++) {
    
        int sInd = // determined by other args
        
        small[sInd] += large[i];
    }
}

Here is an example implementation, disqualified since it makes use of OpenMP 4.5's array reduction. It furthermore undesirably makes use of a local copy of small (as required to use reduction).

void myfunc(double* small, int sLen, double* large, int lLen, // other args) {

    double smallLocal[sLen];
    int i, sInd;

    #pragma omp parallel private (i) for
    for (i=0; i<sLen; s++)
        smallLocal[i] = 0;
    
    #pragma omp parallel private (i, sInd) reduction (+:smallLocal) for
    for (i=0; i<largeLen; i++) {

        sInd = // determined by other args
        
        smallLocal[sInd] += large[i];
    }

    #pragma omp parallel private (i) for
    for (i=0; i<sLen; s++)
        small[i] = smallLocal[i];
}

I'm hoping I can achieve the same thing in OpenMP 3.1, possibly avoiding even the local copy of smallLocal, by managing the array element reductions myself. How can I do this?

Not knowing the exact equation for `SInd` makes things difficult. `SInd` could jump around, wreaking havoc on the cache and making `SIMD` optimizations difficult. — Craig Estey, Jun 28 '21 at 16:56
@CraigEstey that's right. `sInd` is formed by a subset of the bits of `i`; the bit indices are supplied as an additional argument. The bit list is not necessarily contiguous nor sorted. It seems impossible to simultaneously cache-friendly read `large`, and cache-friendly write to `small`. Since `sLen <= lLen`, it seems wiser to optimise `large` reading — Anti Earth, Jun 28 '21 at 17:01
Given that, it might be faster to do: `sInd = i >> bucket_shift` or `sInd = i % BUCKET_SIZE`, or some such cache friendly index, and then reorder according to the "real" `sInd` values after the main loop. — Craig Estey, Jun 28 '21 at 17:15
Hmm is this possible, given the bit indices can be any subset/order? No matter how I reorder the buckets, adjacent `i` values may need to write to far-apart buckets (consider bit indices `{0, 1, 20}`) — Anti Earth, Jun 28 '21 at 17:32
If `small` is small enough, random access may be okay because it will _still_ remain cache hot [if the array fits in the cache size your CPU has]. Based on your answer below (e.g. 32767) will fit inside most caches, so you may be okay. — Craig Estey, Jun 28 '21 at 17:54

score 1 · Accepted Answer · answered Jun 28 '21 at 18:19

1

You could use OpenMPs atomic update construct, which is already present in OpenMP 3.1:

void myfunc(double* small, double* large, int lLen, // other args) {

    #pragma omp parallel for
    for (int i=0; i<lLen; i++) {
    
        int sInd = // determined by other args
        #pragma omp atomic update
        small[sInd] += large[i];
    }
}

This should be faster than using locks.

answered Jun 28 '21 at 18:19

paleonix

2,293
1
13
29

1

Wow, I didn't realise I could do an element-wise atomic update like that! Miraculously, this is only ~2.2x as slow as reduction for `sLen=8` (`lLen = 1048576`), and actually twice as fast for `sLen = 32768` than reduction! – Anti Earth Jun 28 '21 at 19:15

score 0 · Answer 2 · answered Jun 28 '21 at 16:44

Here's a somewhat unsatisfying solution using locks, which requires a local locks array as big as small (which may need to be malloc'd).

void myfunc(double* small, int sLen, double* large, int lLen, // other args) {

    int i, sInd;

    // clear small
    #pragma omp parallel private (i) for
    for (i=0; i<sLen; s++)
        small[i] = 0;

    // initialise locks
#ifdef _OPENMP
    omp_lock_t locks[sLen];
    #pragma omp parallel private (i) for
    for (i=0; i<sLen; s++)
        omp_init_lock(&locks[i]);
#endif 
    
    // main loop
    #pragma omp parallel private (i, sInd) for
    for (i=0; i<largeLen; i++) {

        sInd = // determined by other args
        
        // update small with locks
#ifdef _OPENMP
        omp_set_lock(&locks[sInd]);
        small[sInd] += large[i];
        omp_unset_lock(&locks[sInd]);
#else 
        small[sInd] += large[i];
#endif
    }

    // destroy locks
#ifdef _OPENMP
    #pragma omp parallel private (i) for
    for (i=0; i<sLen; s++)
        omp_destroy_lock(&locks[i]);
#endif 
}

As expected, this was slower than using OpenMP 4.5's array reduction:

for sLen = 8, lLen = 1048576, using locks was 26x slower
for sLen = 32768, lLen = 1048576, using locks was 1.5x slower

It should be mentioned that OpenMP 4.5's array reduction is only compatible with local stack arrays, and not with dynamic memory. This limits the size of the reduced array to ~1 MiB, or 2^17 doubles (lLen = 131072). In contrast, this locks solution can handle any size, since locks can be malloc'd.

You might be better off using `stdatomic.h` primitives instead of locks. (e.g.) `atomic_fetch_add(&small[sInd],large[i]);` — Craig Estey, Jun 28 '21 at 16:59
@CraigEstey do you have a recc resource for learning about these atomic functions? My beloved guide doesn't include them (https://bisqwit.iki.fi/story/howto/openmp) — Anti Earth, Jun 28 '21 at 17:28
That's a tall order. I've looked for a detailed tutorial in the past. Most of the links from a web search point to the `c++` versions and they are _not_ so useful for `c`. I've just used trial-and-error mostly. But, see: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html and https://www.unix.com/man-page/freebsd/3/atomic_fetch_add/ _Hint:_ Any other primitive can be implemented using just `atomic_compare_exchange_*` [checking the return value and looping on false] — Craig Estey, Jun 28 '21 at 17:46

score -1 · Answer 3 · answered Jun 28 '21 at 17:17

-1

You could do this sort of thing, which is common in scientific codes:

#pragma omp parallel for
for(int i = 0; i < N/3; i++)
{
    for (int j = 0; j < 3; j++)
    {
        small[i] += large[3*i + j];
    }
}

answered Jun 28 '21 at 17:17

Ryan Pepper

83
8

OpenMP reduce array without thread-local copies

3 Answers3

Linked