CUDA AtomicCAS Deadlock

Question

I have an array matrix with values of 0, and I want to increment some of it's elements by 1. The indices of matrix which I want to increment are stored in array indices. I need to increment some elements several times, thus I'm trying to use an array of mutexes for each of elements in matrix. But when I launch my code, the program hangs and I get deadlock.

I'm stuck with this issue. What I ultimately want to do is to draw a continuous brush stroke that overlaps itself using CUDA, thus I need to access the same pixels of canvas image in parallel.

Here is my code:

#include <iostream>
using namespace std;

__global__ void add_kernel(int* matrix, int* indices, int* d_semaphores, int nof_indices)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x; // thread id
    int ind = indices[index]; // indices of target array A to increment    

    if (index < nof_indices) {
        while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
        matrix[ind] += 1;
        atomicExch(&d_semaphores[ind], 0);
        __syncthreads();
    }
}

int main()
{
    int nof_indices = 6; // length of an array B
    int indices[6] = { 0,1,2,3,4,1 }; // array B; stores indices of an array A which to increment
    int canvas[10]; // array A
    int semaphores[10]; // mutex array with individual mutexes for each of array A elements

    int* d_canvas;
    int* d_indices;
    int* d_semaphores;

    memset(canvas, 0, sizeof(canvas)); // set all array A elements to 0
    memset(semaphores, 0, sizeof(semaphores)); // set all array A elements to 0    

    cudaMalloc(&d_canvas, sizeof(canvas));
    cudaMalloc(&d_semaphores, sizeof(semaphores));
    cudaMalloc(&d_indices, sizeof(indices));

    cudaMemcpy(d_canvas, &canvas, sizeof(canvas), cudaMemcpyHostToDevice);
    cudaMemcpy(d_indices, &indices, sizeof(indices), cudaMemcpyHostToDevice);
    cudaMemcpy(d_semaphores, &semaphores, sizeof(semaphores), cudaMemcpyHostToDevice);

    add_kernel << <1, 6 >> > (d_canvas, d_indices, d_semaphores, nof_indices);

    cudaMemcpy(&canvas, d_canvas, sizeof(canvas), cudaMemcpyDeviceToHost);

    for (int it = 0; it < nof_indices; it++) {
        cout << canvas[it] << endl;
    }

    cudaFree(d_canvas);
    cudaFree(d_indices);
    cudaFree(d_semaphores);

    return 0;
}

In this example the resulting array matrix should look like this : {1, 2 ,1 ,1,1,0} , but I only get it when I run kernel with dimensions << 6,1 >>.

I'm using CUDA 12.1, Geforce RTX 3060

Thank you

( It only works when I set thread per block size to 1, but it's not what I want )

Note that using CAS and mutexes on a GPU is very inefficient. GPUs are not design for running that efficiently. GPUs are design to execute *cooperative* parallel algorithms and not ones with mutual *exclusion* (or serial atomic operation). This is because of the large amount of parallelism combined with the higher latency compared to CPUs. I strongly advise to revise your algorithm not to use this. If you cannot, then consider using this on CPU. In fact, doing the operation serially on CPU might even be faster (because of the high frequency and ILP). — Jérôme Richard, May 13 '23 at 15:50
Why are you not just doing atomic increments of the corresponding pixels instead of using mutexes? For reference, this parallel pattern is called a scatter (with conflict handling). — paleonix, May 13 '23 at 15:58
@paleonix I need mutex because I would also need to sort in which order to increment (eventually it won't be just 1 that I'm adding. Think of drawing opaque brush strokes, depending index along the brush stroke, addition to the canvas could be on top of current pixel value (where atomicAdd would work, you are right), or "under" current pixel value ( not adding anything to current pixel value) — sergei, May 13 '23 at 16:57

score 3 · Accepted Answer · answered May 13 '23 at 18:18

In a pre-volta execution model, this line of code is/would have been problematic:

    while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);

The topic is addressed generally in this blog "Independent Thread Scheduling" and also in various SO questions such as this one and this one.

However, as indicated in the blog (and elsewhere) the volta execution model should allow more flexible paradigms. I believe the problem here is arising due to a feature of nvcc:

To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal’s thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.

If you compile for a pre-volta architecture, you are indicating to the compiler that you want pre-volta semantics. This may have an effect on the execution behavior of your code for example in the case where you are executing on a volta or newer architecture, but compiling for a pre-volta target.

According to my testing the code deadlocks on sm_75 if I compile using default switches on CUDA 12.1, which by default selects a sm_52 target (including PTX). However if I compile for a sm_75 target the code runs "normally".

I think your code will not deadlock on your RTX 3060 if you compile for a Volta or newer target. Unless you have a reason not to, a general recommendation is to compile specifying the target(s) you wish to run on.

The architecture I suggest targetting is `sm_86` for your GPU. But any volta or newer target should get past this issue. — Robert Crovella, May 13 '23 at 19:06
@sergei Consider marking this answer as "accepted" (checkmark below answer score) so people can see that your problem was solved. — paleonix, May 18 '23 at 12:50

CUDA AtomicCAS Deadlock

1 Answers1