CUDA: How to atomically execute these instructions (4 addresses)

Question

I'm implementing an algorithm in Cuda that needs to perform the following steps:

Given an array x (in shared mem) and some device function f,

Select a pair of indices (i,j) to x (randomly).
Calculate y = f(x[i], x[i - 1], x[j], x[j + 1]).
Based on y decide whether to exchange the positions of x[i] and x[j].

The problem is that the function f depends on 4 values in shared memory, all of which have to be guaranteed to remain unchanged until after the swap.

For a minute I figured this could be the body of a critical section, but I don't see how I could use a single lock-address to lock 4 variables. The main problem, I think, is that when some thread is working on (i,j), other threads are not allowed to work on any pair (k,l) where k or l are any of {i, i-1, j, j+1}.

EDIT

Right after posting, an idea came to mind... Would it be possible to have a cascade of locks? First lock x[i], if that succeeds lock x[i-1], etc for all 4 values. Only if the final lock succeeds, process the above mentioned steps. I'll go experiment and keep this question open to other suggestions.

The idea in your edit would work, **if** you are careful about the order in which you take the locks (basically, same order across all threads — e.g. ordered by the index you are using) and make sure to release them if you fail to acquire all locks. Otherwise you invite deadlocks. However, this is far from ideal, I would advise you to step back to the drawing board and see if you can avoid the need for this operation (or at least, the need to have it so finely synchronized.) — , Aug 06 '16 at 11:08
@delnan I have been back to the drawing board several times now because of this issue but I can't find another way. These are simply the operations that have to be executed. It also seems quite suitable for parallelization since x is typically large and the odds for collisions are reasonable small. What is this "&mdash" you are speaking of? — JorenHeit, Aug 06 '16 at 11:12
Oops, "—" would have been the fancy variant of "---" if Stack Overflow interpreted HTML in comments =/ — , Aug 06 '16 at 11:22
Not knowing exactly what your ambition is or how "randomly" you need the index to be but I was wondering if you could divide the problem up into blocks that would allow "random" choice but would not overlap. Perhaps range limiting the random number via modulo of some sort. — William Jones, Aug 06 '16 at 14:12
@WilliamJones It's a stochastic optimization problem, so the algorithm depends on the choice of (i,j) to be as random as possible. Non-overlapping segments of memory would completely defeat the purpose. — JorenHeit, Aug 06 '16 at 14:18
I think a multi-lock approach would be difficult to set up and prone to deadlock and other errors. Within a threadblock I would consider a warp-specialization approach. Have a work queue (of read/write requests) and have a single warp process those requests. Other warps can issue work into the queue. To extend this across an entire grid, you could go to block-specialization, and select the first block loaded as the master block/work processor. This allows for coherent serialization of access and no requirement to use locks. It's similar to the answer given by @talonmies, expanding on it. — Robert Crovella, Aug 06 '16 at 14:37

score 2 · Accepted Answer · answered Aug 06 '16 at 20:34

CUDA is very lock-unfriendly and critical-section-unfriendly :) One of many reasons is that it operates in a 32-wide SIMD mode. This may cause unexpected deadlocks because of it. Consider for example:

__shared__ int crit;
crit = 0;
__syncthreads();
int old;
do {
    old = atomicCas(&crit, 0, 1);
} while (old==1);
//critical section
crit = 0;

The intetion is that threads actively wait in the do-while loop. Only one thread exists the loop at a time, perform action in the critical secton and then resets crit to 0. However, in CUDA, a warp scheduler will always give priority to 31 threads in the loop over the 1 thread that exits. Because warps operate in SIMD, the thread in the critical section never executes and you get an unexpected deadlock.

For that reason I would strongly suggest trying to avoid critical sections completely.

Now, I don't know the details of your algorithm. I assume that you have some "master" for/while loop and in each iteration you pick a random pair for a possible swap.

You say the collisions don't happen often. If it does, could you just choose to drop one of the conflicting pairs completely, instead of waiting for it to succeed?

If that is something you would accept, then just detecting the collision would be the problem, not the action that you take afterwards. To detect collisions you could for example:

After each thread comes up with a pair candidate, sort the pair indices and then check the values held by the neighbours.
Have a flag array f of the same size as x and atomicCas on it 4 times, similarly to what you suggested. If f is in shared memory, it should not be costly.

Now, when a thread sees it is in conflict, it does nothing. Just waits for all other threads to complete their work, __syncthreads, and then goes for the next iteration of the master for/while loop.

The difference from your proposed solution is that if you fail the lock, your thread just drops his work, instead of trying to wait.

Thanks! Very informative and helpful. I hadn't thought of simply doing nothing. That will work just fine I guess... — JorenHeit, Aug 07 '16 at 07:58

score 1 · Answer 2 · answered Aug 06 '16 at 13:38

1

It seems to me you are hugely overthinking this. If all the memory transactions must be serialised for the operation to be thread safe, then the simplest solution is to have one thread per block perform the operation. So something like

if (threadIdx.x == 0) // assume 1D block for simplicity
{
    y = f(x[i], x[i - 1], x[j], x[j + 1]);
    compare_and_swap(y, x[i], x[j];
}
__syncthreads();

will work fine because the array being operated on is in shared memory, so a guaranteed single thread per block is performing the operation, and there are no read-after-write hazards. In practice, this approach shouldn't be slower than hve a whole block of threads in contention for a lock, or a large number of serialised atomic memory transactions.

answered Aug 06 '16 at 13:38

talonmies

70,661
34
192
269

Hmm, the problem has already been divided into many blocks, each of which share a piece of memory on which these operations have to be done. As I mentioned in one of the comments under the question, there won't be much contention for locks because the array `x` will typically be large compared to the number of threads operating on it. – JorenHeit Aug 06 '16 at 14:17
It might not have been clear that this operation has to be repeated many many many times, which is why I need it to run parallel within each block. – JorenHeit Aug 06 '16 at 14:31
2

@JorenHeit: None of this makes much sense. If you need locks or critical sections, then you *are not* running the code in parallel, you are serialising execution. So the question, it seems to me, should be "what is the lightest weight, most reliable way to serialise this operation", and the answer is have one thread at whatever granularity your algorithm can tolerate for *correctness*. Depending on your use case, that will be one thread per block, or perhaps on thread per warp. If you want more parallelism, run many small blocks. – talonmies Aug 06 '16 at 14:35
If the computational complexity of `y=f(...)` is relatively small, then I agree with @talonmies. You will be serializing access anyway (no matter what you do -- as it is required for correctness). If the computational complexity of `y` is huge, then it may benefit to decouple the serialized reading/writing activity from the `y` processing activity, and warp specialization with a work queue could accomplish this. – Robert Crovella Aug 06 '16 at 14:44
@talonmies I guess I don't understand. How is this serializing when 99% of the time a lock can be acquired without conflict? – JorenHeit Aug 06 '16 at 15:10
This is CUDA wet are talking about. Every time a thread in a warp gets the lock, the other 31 are serialised. And if the lock has block wide scope, the the other warps must be stalled as well. There is no such thing was an atomic read in CUDA, so I don't see any other way to do what you have described. – talonmies Aug 06 '16 at 16:25
@talomnies Thanks for the info. Learning a lot here... Is this even true when all threads in a warp acquire different locks? Every element in the shared memory will have its own unique corresponding lock. – JorenHeit Aug 06 '16 at 17:21

CUDA: How to atomically execute these instructions (4 addresses)

2 Answers2