1

I would like to ask to you guys if there is a better way to combine 2 atomics.

My goal is to find the highest results for a set of K equations (more than 32) under a list of J parameters values (very similar a 2-way input) and to save the value and the j index.

if (atomicMax(&max_k[id], t_max) < t_max) atomicExch(&indexMax[id],t_pos);

Initially we used the approach described above, but, since we do expect an even higher value for every thread, it is possible to have in the same warp the thread B > C > A (thread B has the highest value and thread C has a value higher than A). I'm not sure but the atomicExch can be executed in another thread order than the atomicMax was (is that correct?), so we tried a critical section, but it led to a deadlock. After all the solutions bellow seems to work.

Is there a better way or there is any issue in the following code?

__device__ int atomicMaxCAS(int* addressMax, int valMax, int* addressCAS, int valCas) {
        int oldCas = *addressCAS, assumedCas;
        int oldMax = *addressMax, assumedMax;
        do {
            assumedCas = oldCas;
            assumedMax = oldMax;
            oldMax = atomicMax(addressMax, valMax);
            if (oldMax < valMax) oldCas = atomicCAS(addressCAS, assumedCas, valCas);
        } while (assumedCas != oldCas || assumedMax != oldMax);
        return (oldMax);
    }

Thanks in advance! I was able to start writing CUDA due to all this posts about!

Thiago Conrado
  • 726
  • 8
  • 15
  • 1
    You can't use two atomics like that and expect coherent results. You have set up a possible race condition. Suppose thread A does the `atomicMax` and replaces the old value with 100. Then thread B does the `atomicMax` and replaces the 100 value with 110. Then suppose thread B does the `atomicCAS`, and replaces its index. Then thread A does the `atomicCAS`, and replaces thread B index with thread A index. You now have a max value of 110 with an index corresponding to thread A. Even within a single warp, there is no stated order of execution of atomic operations. – Robert Crovella Aug 16 '16 at 19:49
  • 1
    since your values are both 32-bit quantities, you might be interested in using a custom 64-bit atomic operation [like this](http://stackoverflow.com/questions/17411493/custom-atomic-functions) to update a value and an index at the same time, atomically. – Robert Crovella Aug 16 '16 at 19:51
  • On top of those correctness issues, a reduction is likely to be faster because the atomic operation fully serialises all threads. – tera Aug 16 '16 at 20:12
  • @RobertCrovella: Thanks a lot! I believe that it will solve! – Thiago Conrado Aug 16 '16 at 20:15

1 Answers1

4

there is any issue in the following code?

Yes, you can't use two atomics like that and expect coherent results. You have set up a possible race condition.

Suppose thread A does the atomicMax and replaces the old value with 100. Then thread B does the atomicMax and replaces the 100 value with 110. Then suppose thread B does the atomicCAS, and replaces its index. Then thread A does the atomicCAS, and replaces thread B index with thread A index. You now have a max value of 110 with an index corresponding to thread A.

Even within a single warp, there is no stated order of execution of atomic operations.

Is there a better way?

  1. since your values are both 32-bit quantities, you might be interested in using a custom 64-bit atomic operation like this to update a value and an index at the same time, atomically.

  2. For large scale usage (lots of threads) you may want to explore a classical paraellel reduction. There are questions here on the CUDA tag such as this one and this one that discuss how to do an index+value reduction.

Global atomics on Kepler are pretty fast, so depending on your exact code and reduction "density" a global atomic reduction might not be a big problem performance-wise.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • The 1. option worked like a charm. I was aware of union and 64bits CAS, but, never crossed my mind to use both as indexed float value. Indeed it simplified the whole project because it was using one address for index and another for value. After using the 64bits CAS, I do the Xor shuffle reduction since the data allow it, the code got very fast. I just would like to say many thanks, because, I started CUDA 2 months ago and now we have a kernel that is about 10.000x faster than serial code. A lot of that, I learned from your posts. Great job! I hope you guys carry on for a long time! – Thiago Conrado Aug 17 '16 at 18:15