0

I am trying to implement a GPU application which requires the use of a MUTEX. I know this isn't ideal, but for correctness it is required. When the MUTEX is retrieved, which isn't often, all other threads will halt, and then only the single thread is allowed to continue, until it finishes, at which point all threads may continue normal operation.

I have tried to implement this using atomic operations to modify the flags, and busy waiting for the waiting threads however, at some point the execution just stops. I thought there was simply a deadlock somewhere in my execution, but this doesn't seem to be the case. The execution seems to simply get stuck in a seemingly arbitrary print statement.

Therefore, I was wondering, is there some guarantee that all threads will eventually be processed, or is it possible that the busy waiting loop is hogging all the scheduling cycles of the GPU?

This is the busy waiting loop:

while (flag) {
    if(count > 10000){
      count = 0;   //Only used as breakpoint to see when the cycle has been entered
    }
    if (failFlag) {
        return false;
    }
    count++;
}

This is how the flags are set

bool setFlag(int* loc, int val, bool strict=true) {
    int val_i = val == 0 ? 1 : 0;

    //In devices, atomically exchange
    uint64_cu res = atomicCAS(loc, val_i, val);
    //Make sure the value hasn't changed in the meantime
    if ( (res != val_i) && strict) {
        return false;
    }
    __threadfence();
    return true;
}

and this is the seemingly arbitrary line the execution of the second thread never seems to move past

printf("%i:\t\t\t\tRebuild\n", getThreadID());

where getThreadID() returns threadIdx.x

I first tried using memcheck to see if some issue with the memory was coming up, which gave no errors. Then I tried racecheck which also didn't show any issues. I then used some print statements to see roughly where the execution was hanging in the executing thread. Finally, I used the debugger, which showed that the first thread was moving through the busy waiting loop, while the other thread was seemingly stuck on a random print statement I was using to debug (While there were several other similar statements before that point).

Here is the debugger, lines 377 to 385 are the busy wait loop, while line 206 is just a statement which prints

Thread 1 "main" hit Breakpoint 1, MyProgram::insert (this=0x7fffba000000, k=152796131036661202) at /home/User/MyProgramParallel/src/DataStructure.cu:379
379     in /home/User/MyProgramParallel/src/DataStructure.cu
(cuda-gdb) info cuda thread
Unrecognized option: 'thread'.
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx ThreadIdx Count         Virtual PC                                            Filename  Line
Kernel 0
   (0,0,0)   (0,0,0)     (0,0,0)   (0,0,0)     1 0x0000555558f25e00 /home/User/MyProgramParallel/src/DataStructure.cu   206
*  (0,0,0)   (1,0,0)     (0,0,0)   (1,0,0)     1 0x0000555558f20c70 /home/User/MyProgramParallel/src/DataStructure.cu   379
(cuda-gdb) step
381     in /home/User/MyProgramParallel/src/DataStructure.cu
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx ThreadIdx Count         Virtual PC                                            Filename  Line
Kernel 0
   (0,0,0)   (0,0,0)     (0,0,0)   (0,0,0)     1 0x0000555558f25e00 /home/User/MyProgramParallel/src/DataStructure.cu   206
*  (0,0,0)   (1,0,0)     (0,0,0)   (1,0,0)     1 0x0000555558f20ce0 /home/User/MyProgramParallel/src/DataStructure.cu   381
(cuda-gdb) step
384     in /home/User/MyProgramParallel/src/DataStructure.cu
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx ThreadIdx Count         Virtual PC                                            Filename  Line
Kernel 0
   (0,0,0)   (0,0,0)     (0,0,0)   (0,0,0)     1 0x0000555558f25e00 /home/User/MyProgramParallel/src/DataStructure.cu   206
*  (0,0,0)   (1,0,0)     (0,0,0)   (1,0,0)     1 0x0000555558f20ea0 /home/User/MyProgramParallel/src/DataStructure.cu   384

I would expect both threads to execute steps, with the first moving past line 206, and the other moving through the busy waiting loop. However, this is not the case, no matter how many times I continue the execution the breakpoint. That is why I'm wondering whether there is a liveness guarantee in CUDA? Or is this what a thread looks like after it has crashed? And otherwise, what is another possible reason for this behaviour? Before this point, the two threads seemed to be working in Lockstep.

The CUDA version is 11.3, and the operating system is Ubuntu

Daan W
  • 419
  • 4
  • 10
  • 2
    There is no "liveness" guarantee. You must, among other things, make sure that all threads in the kernel are physically resident. Otherwise you will get the hang you mentioned. A CUDA-provided method to handle these types of situations rather than roll-your-own is to use cooperative groups cooperative kernel launch. – Robert Crovella Oct 26 '22 at 14:54

0 Answers0