I'm a newcomer to cuda, and I try to perform mutex in the kernel function.
I read some tutorials and wrote my function, but in some case, deadlock happened.
Here are my codes, kernel function is very simple to count numbers of running thread started by main function.
#include <iostream>
#include <cuda_runtime.h>
__global__ void countThreads(int* sum, int* mutex) {
while(atomicCAS(mutex, 0, 1) != 0); // lock
*sum += 1;
__threadfence();
atomicExch(mutex, 0); // unlock
}
int main() {
int* mutex = nullptr;
cudaMalloc(&mutex, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int* sum = nullptr;
cudaMalloc(&sum, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int ret = 0;
// pass, result is 1024
countThreads<<<1024, 1>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
// deadlock, why?
countThreads<<<1, 2>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
return 0;
}
So, can anyone tell me why the program deadlocked when calling countThreads<<<1, 2>>>(), and how to fix it? I want to perform cross-block mutex, may be it is not a good idea though. Many thanks.
I experimented for some time, and found if use thread in the same block, deadlock happens, otherwise, everything works well.