5

I just found out about the libcu++ library and am trying to use the cuda::atomic variables. I wrote the following program but it is giving me unexpected results:

#include <atomic>
#include <cuda/atomic>
#include <stdio.h>


#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}


__global__ void atomic_test()
{
    cuda::atomic<int, cuda::thread_scope_block> x{0};
    x.fetch_add(1, cuda::memory_order_seq_cst);
    __syncthreads();
    int y = x.load(cuda::memory_order_acquire);
    printf("(%d %d) - Value of x is %d\n", blockIdx.x, threadIdx.x, y); 
}

int main()
{
    atomic_test<<<2, 32>>>();
    gpuErrchk( cudaDeviceSynchronize() );
    return 0;
}

Since there is a __syncthreads() after the x.fetch_add(), I expect that every thread in the block reads the same value from x. But when I run this program, every thread except thread 31 prints 0 and thread 31 prints 32. Can someone please explain what I am doing wrong and why I am seeing this output?

I tried going through the libcu++ website. But I couldn't get any complete examples except trie.cu and concurrent_hash_table.cu. These examples unfortunately seems a bit too complicated for me.

I also found a stack-overflow solution using semaphores. But it works as I expect it to.

wohlstad
  • 12,661
  • 10
  • 26
  • 39
silversilva
  • 155
  • 6
  • Not idea how this would explain the results, but my understanding is that you still need to put the atomic into global or shared memory. What happens if you add `__shared__` in front of it? – paleonix Mar 05 '23 at 08:53
  • If i use `__shared__` before cuda::atomic i get the error `initializer not allowed for __shared__ variable` – silversilva Mar 05 '23 at 08:59
  • [*The initialization is not atomic.*](https://en.cppreference.com/w/cpp/atomic/atomic/atomic) – paleonix Mar 05 '23 at 09:14
  • @paleonix: non-atomic initialization is presumably true for CUDA atomics as well, but you linked cppreference for ISO C++ `std::atomic`. Are you intentionally implying that `cuda::atomic` works the same as C++ `std::atomic`? In standard C++, each invocation of `atomic_test` would have its own local variables, since `x` isn't declared `static` or anything. – Peter Cordes Mar 05 '23 at 09:23
  • 2
    cuda::atomic must reside in memory which is accessible by all threads in the atomic scope. in shared memory, just initialize it like any other shared variable, `if threadIdx.x == 0) ...; – Abator Abetor Mar 05 '23 at 09:44
  • I think `atomic_ref` is much easier to use here. – paleonix Mar 05 '23 at 09:47
  • @PeterCordes It adds more features, but it is based on the standard semantics. Their docs link to cppreference as well. – paleonix Mar 05 '23 at 09:49
  • @paleonix Using atomic_ref with a shared variable indeed gives me the expected result. But i would still like to know how to use atomic correctly. – silversilva Mar 05 '23 at 10:31
  • `fetch_add` returns a value. Print that instead – talonmies Mar 05 '23 at 12:42
  • @talonmies: That's what I was going to say, but apparently they *want* to also check the interaction with `__syncthreads()`, checking that all threads read the same final value. But yeah, in real code, if you aren't using those 0..31 return values, you shouldn't do 32 separate increments unless they're conditional and you're counting something. (And then local counters that you reduce at the end would be better, right.) – Peter Cordes Mar 05 '23 at 22:43

1 Answers1

3

In your code, you are declaring an atomic in local memory, meaning each atomic is unique to a thread: it cannot be accessed by multiple threads at the same time. Therefore, there is no value in declaring a local variable as an atomic.

Why I am seeing this output?

As a general optimization for atomics, the compiler replaces the atomic increment of 1 on each thread with an atomic increment of 32 (number of threads in a warp) on the leading thread. This specific instance is most likely a compiler bug given that this optimization is not valid for local memory.

The correct way to achieve this is to store your atomic in shared memory. Note that there is no initialization mechanism for shared memory. You can do this by electing a thread in your block to initialize your shared memory variable. This example below is valid:

__global__ void atomic_test()
{
    __shared__ cuda::atomic<int, cuda::thread_scope_block> x;
    if (threadIdx.x == 0)
    {
        x = 0;
    }
    __syncthreads();

    x.fetch_add(1, cuda::memory_order_seq_cst);
    __syncthreads();
    int y = x.load(cuda::memory_order_acquire);
    printf("(%d %d) - Value of x is %d\n", blockIdx.x, threadIdx.x, y);
}

and produces the expected result:

(1 0) - Value of x is 32
(1 1) - Value of x is 32
...
(0 30) - Value of x is 32
(0 31) - Value of x is 32
Anis Ladram
  • 667
  • 3
  • 5