It looks to me like what you want is a "critical section" in your code. A critical section allows one thread to execute a sequence of instructions while preventing any other thread or threadblock from executing those instructions.
A critical section can be used to control access to a memory area, for example, so as to allow un-conflicted access to that area by a single thread.
Atomics by themselves can only be used for a very limited, basically single operation, on a single variable. But atomics can be used to build a critical section.
You should use the following code in your kernel to control thread access to a critical section:
__syncthreads();
if (threadIdx.x == 0)
acquire_semaphore(&sem);
__syncthreads();
//begin critical section
// ... your critical section code goes here
//end critical section
__threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
__syncthreads();
if (threadIdx.x == 0)
release_semaphore(&sem);
__syncthreads();
Prior to the kernel define these helper functions and device variable:
__device__ volatile int sem = 0;
__device__ void acquire_semaphore(volatile int *lock){
while (atomicCAS((int *)lock, 0, 1) != 0);
}
__device__ void release_semaphore(volatile int *lock){
*lock = 0;
__threadfence();
}
I have tested and used successfully the above code. Note that it essentially arbitrates between threadblocks using thread 0 in each threadblock as a requestor. You should further condition (e.g. if (threadIdx.x < ...)
) your critical section code if you want only one thread in the winning threadblock to execute the critical section code.
Having multiple threads within a warp arbitrate for a semaphore presents additional complexities, so I don't recommend that approach. Instead, have each threadblock arbitrate as I have shown here, and then control your behavior within the winning threadblock using ordinary threadblock communication/synchronization methods (e.g. __syncthreads()
, shared memory, etc.)
Note that this methodology will be costly to performance. You should only use critical sections when you cannot figure out how to otherwise parallelize your algorithm.
Finally, a word of warning. As in any threaded parallel architecture, improper use of critical sections can lead to deadlock. In particular, making assumptions about order of execution of threadblocks and/or warps within a threadblock is a flawed approach.
Here is an example of usage of binary_semaphore
to implement a single device global "lock" that could be used for access control to a critical section.