Questions tagged [gpu-atomics]

Modern GPUs support atomic operations in different memory spaces. These are different in implementation and in consequences on execution flow than atomic operations on CPUs.

On modern GPUs, atomic operations in global device memory may require synchronization among thousands of logical threads (or hundreds of warps/wavefronts). A GPU may also support atomic operations on an individual processing core's memory (shared memory in CUDA parlance, local memory in OpenCL parlance) - which behave differently (performance-wise and execution-flow-wise) than global memory atomics.

Reading on GPU atomics:

34 questions
30
votes
2 answers

What are all the atomic operations in CUDA?

I was wondering if there is a complete list of atomic operations usable in CUDA kernels. I couldn't find something like that on the internet.
soroosh.strife
  • 1,181
  • 4
  • 19
  • 45
9
votes
1 answer

Atomic Operations in CUDA? Which header file to include?

For using atomic operations in CUDA, is it necessary to include some CUDA header file? The CUDA programming guide seems to be tightlipped on this. The code glmax.cu given below is giving me the following compilation error. gaurish108 MyPractice:…
smilingbuddha
  • 14,334
  • 33
  • 112
  • 189
9
votes
2 answers

How can I implement a custom atomic function involving several variables?

I'd like to implement this atomic function in CUDA: __device__ float lowest; // global var __device__ int lowIdx; // global var float realNum; // thread reg var int index; // thread reg var if(realNum < lowest) { lowest= realNum; //…
Doug
  • 2,783
  • 6
  • 33
  • 37
7
votes
1 answer

How to use atomic operations on an SSBO in a compute shader

Example code Here is a bare-bones compute shader to illustrate my question layout(local_size_x = 64) in; // Persistent LIFO structure with a count of elements layout(std430, binding = 0) restrict buffer SMyBuffer { int count; float…
bernie
  • 9,820
  • 5
  • 62
  • 92
5
votes
1 answer

question about modifing flag array in cuda

i am doing a research about GPU programming and have a question about modifying global array in thread. __device__ float data[10] = {0,0,0,0,0,0,0,0,0,1}; __global__ void gradually_set_global_data() { while (1) { if (data[threadIdx.x +…
hustwjq
  • 51
  • 2
5
votes
1 answer

CUDA atomic operations and concurrent kernel launch

Currently I develop a GPU-based program that use multiple kernels that are launched concurrently by using multiple streams. In my application, multiple kernels need to access a queue/stack and I have plan to use atomic operations. But I do not know…
4
votes
1 answer

Speeding up CUDA atomics calculation for many bins/few bins

I am trying to optimize my histogram calculations in CUDA. It gives me an excellent speedup over corresponding OpenMP CPU calculation. However, I suspect (in keeping with intuition) that most of the pixels fall into a few buckets. For argument's…
kakrafoon
  • 476
  • 5
  • 13
4
votes
2 answers

How to have atomic load in CUDA

My question is how I can have atomic load in CUDA. Atomic exchange can emulate atomic store. Can atomic load be emulated non-expensively in a similar manner? I can use an atomic add with 0 to load the content atomically but I think it is expensive…
kirill
  • 41
  • 3
4
votes
3 answers

error : identifier "atomicAdd" is undefined under visual studio 2010 & cuda 4.2 with Fermi GPU

I was trying to compile some CUDA codes under visual studio 2010 with CUDA 4.2 (I created this CUDA project using Parallel Nsight 2.2), but I encountered an atomic problem "error : identifier "atomicAdd" is undefined", which I still can't solve…
G_fans
  • 183
  • 1
  • 3
  • 13
3
votes
7 answers

CUDA: reduction or atomic operations?

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is: Forcing every thread to store a value in the shared memory and using a reduction algorithm after…
Marco A.
  • 43,032
  • 26
  • 132
  • 246
3
votes
3 answers

Which is faster for CUDA shared-mem atomics - warp locality or anti-locality?

Suppose many warps in a (CUDA kernel grid) block are updating a fair-sized number of shared memory locations, repeatedly. In which of the cases will such work be completed faster? : The case of intra-warp access locality, e.g. the total number of…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
3
votes
1 answer

OpenGL atomic counters vs atomics in a SSBO

I came across this article that states there are no differences in performance between atomic counter buffers and an atomic variable in an…
iam
  • 1,623
  • 1
  • 14
  • 28
2
votes
1 answer

Is a combination of atomic CAS for 64 and 32 bit ok?

My global array contains struct {float,float}. The first thing I do to it is a 64bit CAS on one of the structs. Depending on the return value I (may) want to modify the second float. Now I have the option of either using a 32bit CAS, or a 64bit. I…
John
  • 145
  • 1
  • 9
2
votes
2 answers

Atomic addition to floating point values in OpenCL for NVIDIA GPUs?

The OpenCL 3.0 specification does not seem to have intrinsics/builtins for atomic addition to floating-point values, only for integral values (and that seems to have been the case in OpenCL 1.x and 2.x as well). CUDA, however, has offered…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
2
votes
1 answer

Is there proper CUDA atomicLoad function?

I've faced with the issue that CUDA atomic API do not have atomicLoad function. After searching on stackoverflow, I've found the following implementation of CUDA atomicLoad But looks like this function is failed to work in following…
Denis Kotov
  • 857
  • 2
  • 10
  • 29
1
2 3