Cuda atomics change flag

Question

I have a piece of serial code which does something like this

if( ! variable )
{
  do some initialization here 
  variable = true;
}

I understand that this works perfectly fine in serial and will only be executed once. What atomics operation would be the correct one here in CUDA?

So that multiple threads do not try to modify the same variable simultaneously, which would have resulted in undefined behavior. — sgarizvi, Sep 23 '13 at 15:58
sgar91, yes correct it is old legacy code which I can't change in structure. So basically so first thread which gets executed should execute it, block it for the other ones in the warp and change the variable to true so that no other thread will go into that part anymore. — ThatQuantDude, Sep 23 '13 at 17:04
What you're describing doesn't really sound like an atomic function to me, but instead a critical section. You can search "cuda critical section" in the upper right hand corner for some ideas. Unfortunately one of my posts about cuda critical section got deleted. If you like I can post it as an answer here. "Atomic" functions only allow for a limited manipulation of usually a single variable in your "do some initialization" area. So if that area is at all involved, it probably can't be serviced with an atomic, although atomics help with building critical sections. — Robert Crovella, Sep 23 '13 at 18:39

Robert Crovella · Accepted Answer · 2022-09-12T13:52:09.083

It looks to me like what you want is a "critical section" in your code. A critical section allows one thread to execute a sequence of instructions while preventing any other thread or threadblock from executing those instructions.

A critical section can be used to control access to a memory area, for example, so as to allow un-conflicted access to that area by a single thread.

Atomics by themselves can only be used for a very limited, basically single operation, on a single variable. But atomics can be used to build a critical section.

You should use the following code in your kernel to control thread access to a critical section:

__syncthreads();
if (threadIdx.x == 0)
  acquire_semaphore(&sem);
__syncthreads();
  //begin critical section
  // ... your critical section code goes here
  //end critical section
__threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
__syncthreads();
if (threadIdx.x == 0)
  release_semaphore(&sem);
__syncthreads();

Prior to the kernel define these helper functions and device variable:

__device__ volatile int sem = 0;

__device__ void acquire_semaphore(volatile int *lock){
  while (atomicCAS((int *)lock, 0, 1) != 0);
  }

__device__ void release_semaphore(volatile int *lock){
  *lock = 0;
  __threadfence();
  }

I have tested and used successfully the above code. Note that it essentially arbitrates between threadblocks using thread 0 in each threadblock as a requestor. You should further condition (e.g. if (threadIdx.x < ...)) your critical section code if you want only one thread in the winning threadblock to execute the critical section code.

Having multiple threads within a warp arbitrate for a semaphore presents additional complexities, so I don't recommend that approach. Instead, have each threadblock arbitrate as I have shown here, and then control your behavior within the winning threadblock using ordinary threadblock communication/synchronization methods (e.g. __syncthreads(), shared memory, etc.)

Note that this methodology will be costly to performance. You should only use critical sections when you cannot figure out how to otherwise parallelize your algorithm.

Finally, a word of warning. As in any threaded parallel architecture, improper use of critical sections can lead to deadlock. In particular, making assumptions about order of execution of threadblocks and/or warps within a threadblock is a flawed approach.

Here is an example of usage of binary_semaphore to implement a single device global "lock" that could be used for access control to a critical section.

I don't understand the purpose of `__syncthreads` here. Does it force the other thread to reach this point? Because to me it seems that they are useless since other threads wil reach this point in all cases, no? — Antoine Morrier, Jan 08 '19 at 10:01
One idea behind the critical section negotiation here is that you may be using multiple threads in the threadblock to do the critical section "work". In that case, they should not begin the "work" until the master thread has properly negotiated for and acquired the global lock. The `__syncthreads()` will enforce that behavior, and they will also enforce that all threads have finished the critical section "work" before the lock is released. If you don't need that sort of cooperative behavior in the threadblock, you may not need the `__syncthreads()`. — Robert Crovella, Jan 08 '19 at 14:55
It may not be necessary for correct operation. The lock release code should translate into a direct write to memory with `volatile`, but the `__threadfence()` serves a similar purpose. If you did have any other code that read the semaphore for some reason, the `volatile` would be useful. — Robert Crovella, Jan 08 '19 at 15:21
@RobertCrovella According to the CUDA docs `__threadfence()` can be used only to ensure the correct order of memory accesses. So if you do `__threadfence()` after unlock, it is possible for a thread to be in the intermediate state right between `lock=0` and `__threadfence`. If another thread takes the lock at this moment, some of the memory writes made by a former thread can be still not visible for the latter thread and appear inside the mid of critical section. If you strictly follow CUDA docs, you should put `__threadfence` before `lock = 0`, otherwise it doesn't guarantee anything useful. — , Dec 04 '19 at 21:55
@BhavinChirag There is a threadfence after the critical section, and before lock = 0. Take another look. You should not assume that `acquire_semaphore` and `release_semaphore` are the only thing required to create a proper critical section. You need to digest all the code in the answer. — Robert Crovella, Dec 04 '19 at 21:58
@RobertCrovella Ah sorry, totally agree with your point now. Actually, your implementations seem to be one of the most correct from what I found on SO. — , Dec 04 '19 at 22:04

Cuda atomics change flag

1 Answers1

Linked

Related