CUDA kernel behavior depending on the number of blocks and threads even if not used

Question

I've got the following kernel:

__global__ void myKernel(int k, int inc, int width, int* d_Xco, int* d_Xnum, bool* 
        Xvalid, float* d_X)
    {

        int i, k1;  
        i = threadIdx.x + blockIdx.x * blockDim.x;
        //k1 = threadIdx.y + blockIdx.y * blockDim.y;

        if( (i < k)  ){
           for(k1 = 0; k1 < inc; k1++){

             int mul = (d_X[i*inc + k1] >= 2e2);
             d_X[i*inc + k1] *= (float)(!mul);
             d_Xco[i*width + k1] = k*mul;
             d_Xnum[i] += mul;
             d_Xvalid[i*inc + k1] = (!mul) ; 

            }
         }// of if

 }

which is call this way:

  int bx = (int)(k/32)+1;
  int by = (int)(inc/32)+1;

  dim3 b(bDim, 1);
  dim3 t(tDim, 1);
  cmyKernel<< b, t >>>( k, inc, width, d_Xco, d_Xnum, d_Xvalid, d_X );

  cudaThreadSynchronize();

k is around 9000 and inc is around 5000, so I am sure I am not exceeding the number of blocks. If myKernel is called with 1thread/1block in the y dimension, the kernel seems to work fine, however, just changing the number of threads and blocks in y dimension to 10 for example, it gives wrong output, even if within the kernel I am not really using threads and blocks in y. Ideally, I would like to get rid of the for() using k = threadIdx.y + blockIdx.y * blockDim.y

If you launch a kernel with y dimension = 10 than you are using them. Only because you are not using the thread identifier threadIdx.y and blockIdx.y does not mean that the threads are not launched. When you launch a kernel with y dimension = 10 you will have 10 threads with i = 0, 10 threads with i = 1 etc. — brano, Oct 02 '12 at 10:51
@ brano: so when I do `dim3 t(tDim, 10)` which part of my kernel can be corrupted by that? I am sorry, i don't get it. I understand that threads will be launched, but if I am not using them, why is the reason my kernel gets corrupted? — Manolete, Oct 02 '12 at 10:54
Say you launch 2x2 threads for simplicity. You will have threads (0,0)(0,1) (1,0) (1,1). In your code the i variable is 0 for two threads (0,0) and (0,1) but the threadIdx.y is different. This means that both threads will evaluate the code for the same i variable and cause race conditions. — brano, Oct 02 '12 at 12:04
and what can I do to solve it? I want to use threads also for `k1` as I did with `i`. I thought I could do it launching threads in `x` and `y` dimension.. — Manolete, Oct 02 '12 at 12:50
You could do that but you need to resolve the dependency between iterations (d_Xnum[i] += mul). One way to do that is to use an atomicAdd(..). Uncomment the k1, replace the loop with if(k1 < inc) and add the atomicAdd. That should give you the correct behavior. — brano, Oct 02 '12 at 13:14
I don't see race condition on `d_Xnum[i] += mul` as 2 threads won't be accessing at the same time — Manolete, Oct 02 '12 at 13:44
The problem is I need to use `dim3 threads(32,32)` if I use more, it breaks... — Manolete, Oct 02 '12 at 14:30
You are using more threads than allowed (per block). Please post a new question. Your question has already been answered for this post. — brano, Oct 02 '12 at 14:37
I made an answer so feel free to accept it if you think it was correct. — brano, Oct 02 '12 at 14:53

score 2 · Accepted Answer · answered Oct 02 '12 at 14:52

If you launch a kernel with y dimension = 10 than you are using them. Only because you are not using the thread identifier threadIdx.y and blockIdx.y does not mean that the threads are not launched. When you launch a kernel with y dimension = 10 you will have 10 threads with i = 0, 10 threads with i = 1 etc.

Say you launch 2x2 threads for simplicity. You will have threads (0,0)(0,1) (1,0) (1,1). In your code the i variable is 0 for two threads (0,0) and (0,1) but the threadIdx.y is different. This means that both threads will evaluate the code for the same i variable and cause race conditions.

You need to resolve the dependency between iterations (d_Xnum[i] += mul). One way to do that is to use an atomicAdd(..). Uncomment the k1, replace the loop with if(k1 < inc) and add the atomicAdd. That should give you the correct behavior.

score 1 · Answer 2 · edited May 23 '17 at 12:18

As has been already stated in the comments, your current solution is launching several threads that will each apply their work to the same space of memory. This is due to several threads being spawned that all have a threadIdx.x of the same value, while the threadIdx.y value differs. This means that you will have several threads reading and writing to the same space of memory at the same time, this has many potential problems, here is a brief description.

To avoid this, there are several steps you may take. You can, for instance, use synchronised data access (which will cause a massive slowdown, as threads wait for others to finish data access). If you want to have each thread deal with one cell element, you need to remove the for loop and instead use k1 as before, however you must carefully consider the memory reads and writes as any part of the process may be executed before or after any other in a different thread!

The core understanding here is that you can never rely on the sequence of operations between threads!

When it comes to data access, it helps to think of all your data structures as grids, where each thread should only access and modify data in it's own coordinate, for instance (3,2) for a thread with threadIdx.x == 3 and threadIdx.y == 2. This way, you can easily visualize the behaviour of threads and potential race conditions. The easiest way to use this is to create one grid entry for each element of your output data, so if you have a matrix of 9000x5000 elements, you could potentially spawn that amount of threads to start with, and optimise from there. This will of course cause the GPU to have to execute on all it's units several times, but it is a good starting point.

The University In Oslo has a graduate level course on this topic, among other things. You might find these slides highly relevant to further your understanding. See especially the section regarding thread batching, grids and blocks.

CUDA kernel behavior depending on the number of blocks and threads even if not used

2 Answers2