Implementing of mutex on cuda kernel function happens to be deadlocked

Question

I'm a newcomer to cuda, and I try to perform mutex in the kernel function.

I read some tutorials and wrote my function, but in some case, deadlock happened.

Here are my codes, kernel function is very simple to count numbers of running thread started by main function.

#include <iostream>
#include <cuda_runtime.h>

__global__ void countThreads(int* sum, int* mutex) {
    while(atomicCAS(mutex, 0, 1) != 0); // lock
    
    *sum += 1;
    __threadfence();

    atomicExch(mutex, 0); // unlock
}

int main() {
    int* mutex = nullptr;
    cudaMalloc(&mutex, sizeof(int));
    cudaMemset(&mutex, 0, sizeof(int));

    int* sum = nullptr;
    cudaMalloc(&sum, sizeof(int));
    cudaMemset(&mutex, 0, sizeof(int));

    int ret = 0;
    // pass, result is 1024
    countThreads<<<1024, 1>>>(sum, mutex);
    cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
    std::cout << ret << std::endl; 
    
    // deadlock, why?
    countThreads<<<1, 2>>>(sum, mutex);
    cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
    std::cout << ret << std::endl;

    return 0;
}

So, can anyone tell me why the program deadlocked when calling countThreads<<<1, 2>>>(), and how to fix it? I want to perform cross-block mutex, may be it is not a good idea though. Many thanks.

I experimented for some time, and found if use thread in the same block, deadlock happens, otherwise, everything works well.

See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture in the paragraph starting with "Prior to Volta, ...". — havogt, Nov 25 '20 at 12:56

Robert Crovella · Accepted Answer · 2020-11-26T16:54:32.500

Threads in the same warp attempting to negotiate for a lock or mutex is probably the worst-case scenario. It is fairly difficult to program correctly, and the behavior may change depending on the exact GPU you are running on.

Here is an example of the type of analysis needed to explain the exact reason for the deadlock in a particular case. Such analysis is not readily done on what you have shown here because you have not indicated the type of GPU you are compiling for, or running on. It's also fairly important to provide the CUDA version you are using for compilation. I have witnessed code changes from one compiler generation to another, that may impact this. Even if you provided that information, I'm not sure the analysis is really worth-while, because I consider the negotiation-within-a-warp case to be extra troublesome to program correctly. This question/answer may also be of interest.

My general suggestion for a newcomer in CUDA (as you say) would be to use a method similar to what is described here. Briefly, negotiate for a lock at the threadblock level (ie. have one thread in each block negotiate among other blocks for the lock) then manage singleton activity within the block using standard, available block-level coordination schemes, such as __syncthreads(), and conditional coding.

You can learn more about this topic by searching on the cuda tag for such keywords as "lock" "critical section" etc.

FWIW, for me, anyway, your code does deadlock on a Kepler device and does not deadlock on a Volta device, as suggested by the reference in the comments. I'm not attempting to communicate any statement about whether your code is defect-free, it's just an observation. If I modify your kernel to look like this:

__global__ void countThreads(int* sum, int* mutex) {

    int old = 1;
    while (old){
      old = atomicCAS(mutex, 0, 1);  // lock
      if (old == 0){
        *sum += 1;
        __threadfence();

        atomicExch(mutex, 0); // unlock
        }
      }
}

Then it seems to me to work in either the Kepler case or the Volta case. I'm not advancing this example to suggest it is "correct", rather to show that somewhat innocuous code modifications can change a code from deadlock to non-deadlock case, or vice versa. This kind of fragility is best avoided, certainly in the pre-Volta case, in my opinion.

For the volta and forward case, CUDA 11 and forward, you may want to use capability from the libcu++ library such as semaphore

Implementing of mutex on cuda kernel function happens to be deadlocked

1 Answers1