Atomic operators, SSE/AVX, and OpenMP

Question

I'm wondering if SSE/AVX operations such as addition and multiplication can be an atomic operation? The reason I ask this is that in OpenMP the atomic construct only works on a limited set of operators. It does not work on for example SSE/AVX additions.

Let's assume I had a datatype float4 that corresponds to a SSE register and that the addition operator is defined for float4 to do an SSE addition. In OpenMP I could do a reduction over an array with the following code:

float4 sum4 = 0.0f; //sets all four values to zero
#pragma omp parallel
{
    float4 sum_private = 0.0f;
    #pragma omp for nowait
    for(int i=0; i<N; i+=4) {
        float4 val = float4().load(&array[i]) //load four floats into a SSE register
        sum_private4 += val; //sum_private4 = _mm_addps(val,sum_private4)
    }
    #pragma omp critical
    sum4 += sum_private;
}
float sum = horizontal_sum(sum4); //sum4[0] + sum4[1] + sum4[2] + sum4[3]

But atomic is faster than critical in general and my instinct tells me SSE/AVX operations should be atomic (even if OpenMP does not support it). Is this a limitation of OpenMP? Could I use for example e.g. Intel Threading Building Blocks or pthreads to do this as an atomic operation?

Edit: Based on Jim Cownie's comment I created a new function which is the best solution. I verified that it gives the correct result.

float sum = 0.0f;
#pragma omp parallel reduction(+:sum)
{
    Vec4f sum4 = 0.0f;  
    #pragma omp for nowait
    for(int i=0; i<N; i+=4) {
        Vec4f val = Vec4f().load(&A[i]); //load four floats into a SSE register
        sum4 += val; //sum4 = _mm_addps(val,sum4)
    }
    sum += horizontal_add(sum4);
}

Edit: based on comments Jim Cownie and comments by Mystical at this thread OpenMP atomic _mm_add_pd I realize now that the reduction implementation in OpenMP does not necessarily use atomic operators and it's best to rely on OpenMP's reduction implementation rather than try to do it with atomic.

Relevant: http://stackoverflow.com/questions/7646018/sse-instructions-single-memory-access — Mikhail, May 14 '13 at 19:33
Thanks, that's a very interesting discussion. It appears SSE read/writes may or may not be atomic. Based on that discussion I think AVX, at least pre-Haswell, is not necessarily atomic since although it can two 128 bit reads in one cycle (but maybe not necessarily atomically) it can only write 128 bits in one cycle — , May 14 '13 at 20:37
Why don't you use an OpenMP reduction? You could do the horizontal reduction inside the thread (where you have the omp critical), and then reduce over sum. That way you don't need any critical or atomic operation... — Jim Cownie, May 16 '13 at 08:16
@JimCownie, maybe I misunderstand what you mean but reductions can only be done on a limited set of operators on POD. SSE/AVX is not one of the operators. Additionally, reductions in OpenMP use atomic. — , May 16 '13 at 10:29
I guess what Jim Cownie meant was to do the horizontal sum over the private vectors and to use `reduction(+:sum)` to get the final sum. This way you won't need any _explicit_ `atomic` constructs. — Hristo Iliev, May 16 '13 at 12:39
@JimCownie. You are correct. Thank you!!! That's a much better solution! I added the new code to reflect what I think you meant. And thank you Hristo for prodding me to think about Jim's comment more carefully. — , May 16 '13 at 13:51
"reductions in OpenMP use atomic." They certainly do not have to, the reduction happens at a barrier, so you can reduce up a tree where you know which thread should be performing the operation, and there is no need for any atomic operations. — Jim Cownie, May 17 '13 at 12:36
@JimCrownie, I learned that OpenMP uses atomic for reductions from the following link http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause I guess he's wrong (though he does say illustration only). In my limited experience using atomic explicitly for a reduction did not make the performance worse. In any case I got this code working with a reduction only (see what I added) and that's the code I will use. — , May 17 '13 at 16:31
@JimCownie, you're right again, I was wrong about atomic being necessary in a reduction in OpenMP. I found some good comments on this at this thread. http://stackoverflow.com/questions/14014874/openmp-atomic-mm-add-pd/14015141?noredirect=1#comment24032403_14015141 — , May 22 '13 at 18:14
@JimCownie, if you want the points, you can write your comments as an answer and I'll accept it. — , May 23 '13 at 07:51

Rick · Answer 1 · 2013-05-16T02:30:44.877

3

SSE & AVX in general are not atomic operations (but multiword CAS would sure be sweet).

You can use the combinable class template in tbb or ppl for more general purpose reductions and thread local initializations, think of it as a synchronized hash table indexed by thread id; it works just fine with OpenMP and doesn't spin up any extra threads on its own.

You can find examples on the tbb site and on msdn.

Regarding the comment, consider this code:

x = x + 5

You should really think of it as the following particularly when multiple threads are involved:

while( true ){
    oldValue = x
    desiredValue = oldValue + 5
    //this conditional is the atomic compare and swap
    if( x == oldValue )
       x = desiredValue
       break;
}

make sense?

edited May 16 '13 at 02:30

answered May 14 '13 at 22:55

Rick

3,285
17
17

Thanks, my instinct was wrong then. I'm a bit surprised that SSE is not atomic since e.g. Sandy Bridge can read/write a 128 bit word in one clock cycle (assuming the word is 16 byte aligned). Shouldn't that be atomic? What's special about 64bit or 32bit words that makes them atomic? – May 15 '13 at 18:07
Thank you for the additional comment. I think I more or less understand the code. That's just the Copy and Swap (CAS) algorithm as explained on Wikipedia. But it's not clear to me why a 32 bit word is atomic and a 128 bit word is not. (I am very new to most of these mulch-threading concepts). Is it because there is no way to do a 128-bit equality test with SSE in one instruction? – May 16 '13 at 11:52
Opps, hehe, I mean Compare and Swap. I confused the name with the Copy and Swap idiom. – May 16 '13 at 11:57

Atomic operators, SSE/AVX, and OpenMP

1 Answers1

Linked