I am trying to implement a GPU application which requires the use of a MUTEX. I know this isn't ideal, but for correctness it is required. When the MUTEX is retrieved, which isn't often, all other threads will halt, and then only the single thread is allowed to continue, until it finishes, at which point all threads may continue normal operation.
I have tried to implement this using atomic operations to modify the flags, and busy waiting for the waiting threads however, at some point the execution just stops. I thought there was simply a deadlock somewhere in my execution, but this doesn't seem to be the case. The execution seems to simply get stuck in a seemingly arbitrary print statement.
Therefore, I was wondering, is there some guarantee that all threads will eventually be processed, or is it possible that the busy waiting loop is hogging all the scheduling cycles of the GPU?
This is the busy waiting loop:
while (flag) {
if(count > 10000){
count = 0; //Only used as breakpoint to see when the cycle has been entered
}
if (failFlag) {
return false;
}
count++;
}
This is how the flags are set
bool setFlag(int* loc, int val, bool strict=true) {
int val_i = val == 0 ? 1 : 0;
//In devices, atomically exchange
uint64_cu res = atomicCAS(loc, val_i, val);
//Make sure the value hasn't changed in the meantime
if ( (res != val_i) && strict) {
return false;
}
__threadfence();
return true;
}
and this is the seemingly arbitrary line the execution of the second thread never seems to move past
printf("%i:\t\t\t\tRebuild\n", getThreadID());
where getThreadID()
returns threadIdx.x
I first tried using memcheck to see if some issue with the memory was coming up, which gave no errors. Then I tried racecheck which also didn't show any issues. I then used some print statements to see roughly where the execution was hanging in the executing thread. Finally, I used the debugger, which showed that the first thread was moving through the busy waiting loop, while the other thread was seemingly stuck on a random print statement I was using to debug (While there were several other similar statements before that point).
Here is the debugger, lines 377 to 385 are the busy wait loop, while line 206 is just a statement which prints
Thread 1 "main" hit Breakpoint 1, MyProgram::insert (this=0x7fffba000000, k=152796131036661202) at /home/User/MyProgramParallel/src/DataStructure.cu:379
379 in /home/User/MyProgramParallel/src/DataStructure.cu
(cuda-gdb) info cuda thread
Unrecognized option: 'thread'.
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line
Kernel 0
(0,0,0) (0,0,0) (0,0,0) (0,0,0) 1 0x0000555558f25e00 /home/User/MyProgramParallel/src/DataStructure.cu 206
* (0,0,0) (1,0,0) (0,0,0) (1,0,0) 1 0x0000555558f20c70 /home/User/MyProgramParallel/src/DataStructure.cu 379
(cuda-gdb) step
381 in /home/User/MyProgramParallel/src/DataStructure.cu
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line
Kernel 0
(0,0,0) (0,0,0) (0,0,0) (0,0,0) 1 0x0000555558f25e00 /home/User/MyProgramParallel/src/DataStructure.cu 206
* (0,0,0) (1,0,0) (0,0,0) (1,0,0) 1 0x0000555558f20ce0 /home/User/MyProgramParallel/src/DataStructure.cu 381
(cuda-gdb) step
384 in /home/User/MyProgramParallel/src/DataStructure.cu
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line
Kernel 0
(0,0,0) (0,0,0) (0,0,0) (0,0,0) 1 0x0000555558f25e00 /home/User/MyProgramParallel/src/DataStructure.cu 206
* (0,0,0) (1,0,0) (0,0,0) (1,0,0) 1 0x0000555558f20ea0 /home/User/MyProgramParallel/src/DataStructure.cu 384
I would expect both threads to execute steps, with the first moving past line 206, and the other moving through the busy waiting loop. However, this is not the case, no matter how many times I continue the execution the breakpoint. That is why I'm wondering whether there is a liveness guarantee in CUDA? Or is this what a thread looks like after it has crashed? And otherwise, what is another possible reason for this behaviour? Before this point, the two threads seemed to be working in Lockstep.
The CUDA version is 11.3, and the operating system is Ubuntu