Is there proper CUDA atomicLoad function?

Question

I've faced with the issue that CUDA atomic API do not have atomicLoad function. After searching on stackoverflow, I've found the following implementation of CUDA atomicLoad

But looks like this function is failed to work in following example:

#include <cassert>
#include <iostream>
#include <cuda_runtime_api.h>

template <typename T>
__device__ T atomicLoad(const T* addr) {
    const volatile T* vaddr = addr;  // To bypass cache
    __threadfence();                 // for seq_cst loads. Remove for acquire semantics.
    const T value = *vaddr;
    // fence to ensure that dependent reads are correctly ordered
    __threadfence();
    return value;
}

__global__ void initAtomic(unsigned& count, const unsigned initValue) {
    count = initValue;
}

__global__ void addVerify(unsigned& count, const unsigned biasAtomicValue) {
    atomicAdd(&count, 1);
    // NOTE: When uncomment the following while loop the addVerify is stuck,
    //       it cannot read last proper value in variable count
//    while (atomicLoad(&count) != (1024 * 1024 + biasAtomicValue)) {
//        printf("count = %u\n", atomicLoad(&count));
//    }
}

int main() {
    std::cout << "Hello, CUDA atomics!" << std::endl;
    const auto atomicSize = sizeof(unsigned);

    unsigned* datomic = nullptr;
    cudaMalloc(&datomic, atomicSize);

    cudaStream_t stream;
    cudaStreamCreate(&stream);

    constexpr unsigned biasAtomicValue = 11;
    initAtomic<<<1, 1, 0, stream>>>(*datomic, biasAtomicValue);
    addVerify<<<1024, 1024, 0, stream>>>(*datomic, biasAtomicValue);
    cudaStreamSynchronize(stream);

    unsigned countHost = 0;
    cudaMemcpyAsync(&countHost, datomic, atomicSize, cudaMemcpyDeviceToHost, stream);
    assert(countHost == 1024 * 1024 + biasAtomicValue);

    cudaStreamDestroy(stream);

    return 0;
}

If you will uncomment the section with atomicLoad then application will stuck ...

Maybe I missed something ? Is there a proper way to load variable modified atomically ?

P.S.: I know there exists cuda::atomic implementation, but this API is not supported by my hardware

Which hardware do you have? – Sebastian Feb 05 '22 at 16:15 — Sebastian, Feb 05 '22 at 16:15

huseyin tugrul buyukisik · Accepted Answer · 2022-02-05T16:51:07.793

Since warps work in a lockstep manner (at least in old arch), if you put a conditional wait for one thread and a producer on another thread, both in same warp, then the warp could be stuck in the waiting if it starts/is executed first. Maybe only newest architecture that has asynchronous warp thread scheduling can do this. For example, you should query minor-major versions of cuda architecture before running this. Volta and onwards is ok.

Also you are launching 1million threads and waiting on all of them at once. GPU may not have that many execution ports/pipeline availability to have 1 million threads in-flight. Maybe it would work in only a GPU of 64k CUDA pipelines (assuming 16 threads in flight per pipeline). Instead of waiting on millions of threads, just spawn sub-kernels from main kernel when a condition occurs. Dynamic parallelism is the key feature. You should also check for the minimum minor-major cuda version to use dynamic parallelism just in case someone is using ancient nvidia cards.

Atomic-add command returns the old value in the target address. If you have meant to call a third kernel only once only after the condition, then you can simply check that returned value by an "if" before starting the dynamic parallelism.

You are printing for 1 million times, it is not good for performance and it may take some time before text appears in console output if you have a slow CPU/RAM.

Lastly, you can optimize performance of atomic operations by running them on shared memory first then going global atomic only once per block. This will miss the point of condition if there are more threads than the condition value (assuming always 1 increment value) so it may not be applicable for all algorithms.

Hey, thanks, looks like you are right ... when I decreased number of threads condition starts satisfying. Looks like it is not enough threads on CUDA GPU that is why it stuck on this condition. Could you suggest resources where to read about internal scheduler and async scheduler ? Thanks !! — Denis Kotov, Feb 05 '22 at 13:48
@DenisKotov [this](https://developer.nvidia.com/blog/inside-volta/) blog post about the V100 talks about it under "Independent Thread Scheduling". — paleonix, Feb 05 '22 at 15:14
You probably should check at runtime that all intended threads are active at the beginning of your code. Afterwards there are (currently) few situations only, where they could get inactive before finishing (debugger, dynamic parallelism) — Sebastian, Feb 05 '22 at 16:18
Also there reason I wanted to try check the number of threads was I wanted to created custom cuda::barrier, where cuda::barrier from cuda library is not supported — Denis Kotov, Feb 05 '22 at 16:24
FWIW - kernels which synchronize across blocks like this one (where IIUC each thread waits for all threads in all blocks in the grid to arrive) should generally use cudaLaunchCooperativeKernel, which will check if all threads in the grid can be running simultaneously (and therefore can communicate & synchronize with each other) https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g504b94170f83285c71031be6d5d15f73 — andars, Feb 10 '22 at 03:13

Is there proper CUDA atomicLoad function?

1 Answers1