OpenMP atomic _mm_add_pd

Question

I'm trying to use OpenMP for parallelization of an already vectorized code with intrinsics, but the problem is that I'm using one XMM register as an outside 'variable' that I increment each loop. For now I'm using the shared clause

__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];

#pragma omp parallel for shared(xmm0)
for (int i = 0; i < len; i++)
{
    __m128d xmm7 = ... result of some operations

    xmm0 = _mm_add_pd(xmm0, xmm7);
}

_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];

because the atomic operation is not supported (in VS2010)

__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];

#pragma omp parallel for
for (int i = 0; i < len; i++)
{
    __m128d xmm7 = ... result of some operations

    #pragma omp atomic
    xmm0 = _mm_add_pd(xmm0, xmm7);
}

_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];

Does anyone know a clever work-around?

EDIT: I've also tried it using the Parallel Patterns Library just now:

__declspec(align(16)) double res[2];
combinable<__m128d> xmm0_comb([](){return _mm_setzero_pd();});

parallel_for(0, len, 1, [&xmm0_comb, ...](int i)
{
    __m128d xmm7 = ... result of some operations

    __m128d& xmm0 = xmm0_comb.local();
    xmm0 = _mm_add_pd(xmm0, xmm7);
});

__m128d xmm0 = xmm0_comb.combine([](__m128d a, __m128d b){return _mm_add_pd(a, b);});
_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];

but the result was disappointing.

I don't think there are any atomic SSE arithmetic instructions. So I think the only way to get what you want is to wrap it in a critical section. Bad for performance? Yes. You're better off using a proper reduction algorithm. — Mysticial, Dec 23 '12 at 21:50
@Mysticial: What do you mean by a proper reduction algorithm? — demorge, Dec 23 '12 at 21:56
A parallel reduction. You're essentially asking an XY question. You want to solve your problem using atomic adds. The correct solution is to use a reduction. OpenMP supports it with the `reduction()` directive. You won't be able to do it directly on SSE operands, but you can reduce it to `double` first. — Mysticial, Dec 23 '12 at 21:59
By the way, one of the reasons why VS does not support `atomic` on that statement is that it only implements the 10-years old OpenMP 2.0 standard, which only allows for statements like `x binop= y`, `x++ / ++x`, and `x-- / --x` to appear in the `atomic` construct. — Hristo Iliev, Dec 25 '12 at 23:13

score 4 · Answer 1 · answered Dec 23 '12 at 22:17

4

You're solving the problem the wrong way. You should be using a reduction instead of atomic operations:

This is a better approach:

double sum = 0;

#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < len; i++)
{
    __m128d xmm7;// = ... result of some operations

    //  Collapse to a "double".
    _declspec(align(16)) double res[2];
    _mm_store_pd(res, xmm7);

    //  Add to reduction variable.
    sum += res[0] + res[1];
}

double final_result = sum;

A reduction is essentially an operation that collapses "reduces" everything to a single variable using an associative operation such as +.

If you're doing a reduction, always try to use an actual reduction approach. Don't try to cheat it with atomic operations or critical sections.

The reason for this is that atomic/critical section approaches are inherently not scalable as they maintain a long critical path data-dependency. A proper reduction approach reduces this critical path to log(# of threads).

The only downside of course is that it breaks floating-point associativity. If that matters, then you're basically stuck with sequentially summing up each iteration.

answered Dec 23 '12 at 22:17

Mysticial

464,885
45
335
332

Great, it's already faster than my previous approach, but I'm not perfectly happy with the collapsing inside the loop. I'm going to try out some other things as well :) – demorge Dec 23 '12 at 22:24
There's a bunch of ways to do this. I only collapsed it inside the loop because, AFAIK, OpenMP reduction only supports primitive types. Though I admit that I haven't tried it with operator overloading on classes. – Mysticial Dec 23 '12 at 22:26
I wish I had read this answer before I had asked this question http://stackoverflow.com/questions/16551307/atomic-operators-sse-avx-and-openmp Do you have any references that discuss how a "proper reduction" is done? – May 22 '13 at 12:14
@raxman I don't have any off the top of my head, but idea is pretty simple. A linear reduction (add them one-by-one) has a `O(n)` critical path. A tree reduction, breaks up the set of elements into subsets and sums them independently. Then the results are summed up. When done recursively, it has a `O(log(n))` critical path. – Mysticial May 22 '13 at 16:50
@Mysticial, thanks, I see now my assumption about a reduction in OpenMP using atomic is probably not true. I should stick with using the OpenMP reduction (and not trying to cheat) whenever possible. – May 22 '13 at 18:08

score 2 · Answer 2 · answered Dec 23 '12 at 22:29

What you're looking for is a reduction. You can do that as an omp reduction if your compiler supports it (gcc does), or you can roll one yourself by summing into a private xmm for each thread. Below is a simple version doing both:

#include <emmintrin.h>
#include <omp.h>
#include <stdio.h>


int main(int argc, char **argv) {

    const int NTHREADS=8;
    const int len=100;

    __m128d xmm0[NTHREADS];
    __m128d xmmreduction = _mm_setzero_pd();
    #pragma omp parallel for num_threads(NTHREADS)
    for (int i=0; i<NTHREADS; i++)
        xmm0[i]= _mm_setzero_pd();

    __attribute((aligned(16))) double res[2];

    #pragma omp parallel num_threads(NTHREADS) reduction(+:xmmreduction)
    {
        int tid = omp_get_thread_num();
        #pragma omp for
        for (int i = 0; i < len; i++)
        {
            double d = (double)i;
            __m128d xmm7 = _mm_set_pd( d, 2.*d );

            xmm0[tid] = _mm_add_pd(xmm0[tid], xmm7);
            xmmreduction = _mm_add_pd(xmmreduction, xmm7);
        }
    }

    for (int i=1; i<NTHREADS; i++)
        xmm0[0] = _mm_add_pd(xmm0[0], xmm0[i]);

    _mm_store_pd(res, xmm0[0]);
    double final_result = res[0] + res[1];

    printf("Expected result   = %f\n", 3.0*(len-1)*(len)/2);
    printf("Calculated result = %lf\n", final_result);

    _mm_store_pd(res, xmmreduction);
    final_result = res[0] + res[1];

    printf("Calculated result (reduction) = %lf\n", final_result);

    return 0;
}

+1, Though I just wanted to confirm that VC++ does not support OpenMP reductions on "non-scalar" types. If GCC does, then that would make it a useful a GCC extension. — Mysticial, Dec 23 '12 at 22:34

score 2 · Accepted Answer · answered Dec 23 '12 at 23:01

With great help from the people who answered my question I've come up with this:

double final_result = 0.0;

#pragma omp parallel reduction(+:final_result)
{
    __declspec(align(16)) double r[2];
    __m128d xmm0 = _mm_setzero_pd();

    #pragma omp for
    for (int i = 0; i < len; i++)
    {
        __m128d xmm7 = ... result of some operations

        xmm0 = _mm_add_pd(xmm0, xmm7);
    }
    _mm_store_pd(r, xmm0);
    final_result += r[0] + r[1];
}

It basically separates the collapse and reduction, performs very well.

Many thanks to all who have helped me!

One small comment. I think you can make that code a bit more efficient by removing the barrier on the for loop with "pragma omp for nowait" — , May 23 '13 at 10:49

score 0 · Answer 4 · answered Dec 23 '12 at 21:45

0

I guess you can't add your own intrinsics to the compiler, and MS compilers decided to skip inline assembler. Not sure there is an easy solution at all.

answered Dec 23 '12 at 21:45

Mats Petersson

126,704
14
140
227

OpenMP atomic _mm_add_pd

4 Answers4

Linked