Karatsuba - polynomials multiplication with CUDA

Question

I'm using CUDA for the iterative Karatsuba algorithm and I would like to ask, why is one line computed always different.

First, I implemented this function, which computed the result always correctly:

__global__ void kernel_res_main(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if( i > 0 && i < resultSize - 1){

        TYPE start = (i >= size) ? (i % size ) + 1 : 0;


        TYPE end = (i + 1) / 2;


        for(TYPE inner = start; inner < end; inner++){
            result[i] += ( A[inner] + A[i - inner] ) * ( B[inner] + B[i - inner] );
            result[i] -= ( D[inner] + D[i-inner] );
        }
    }
}

Now I would like to use the 2D grid and use CUDA for the for-loop, so I changed my function to this:

__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){

    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int j = blockDim.y * blockIdx.y + threadIdx.y;

    TYPE rtmp = result[i];

    if( i > 0 && i < resultSize - 1){

        TYPE start = (i >= size) ? (i % size ) + 1 : 0;
        TYPE end = (i + 1) >> 1;

        if(j >= start && j <= end ){

           // WRONG 
           rtmp += ( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] );
        }
    }

    result[i] = rtmp;
}

I am calling this function like this:

dim3 block( 32, 8 );
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
kernel_res_nested <<<grid, block>>> (devA, devB, devD, devResult, size, resultSize);

And the result is alway wrong and always different. I can't understand why is that second implementation wrong and always computes wrong results. I can't see there any logical problem connected with data dependency. Does anyone know How can I solve this problem?

I have never worked in CUDA. That being said, if the answer is always different, it is likely the problem is in data dependency. I would probably have scratch my hair to figure it out. You might have better luck. — Aniket Chowdhury, Apr 25 '18 at 20:27
To understand CUDA in this code is simple. all you need is just to replace the _if conditions_ for the _for loops_ and run that code with a lot of threads. But Thank you, I will try to figure out why is there data dependency. — , Apr 25 '18 at 20:33

Robert Crovella · Accepted Answer · 2018-04-25T22:07:34.353

For question like this, you are supposed to provide a MCVE. (See item 1 here) For example, I don't know what type is indicated by TYPE, and it does matter for the correctness of the solution I will propose.

In your first kernel, only one thread in your entire grid was reading and writing location result[i]. But in your second kernel, you now have multiple threads writing to the result[i] location. They are conflicting with each other. CUDA doesn't specify the order in which threads will run, and some may run before, after, or at the same time as, others. In this case, some threads may read result[i] at the same time as others. Then, when the threads write their results, they will be inconsistent. And it may vary from run-to-run. You have a race condition there (execution order dependency, not data dependency).

The canonical method to sort this out would be to employ a reduction technique.

However for simplicity, I will suggest that atomics could help you sort it out. This is easier to implement based on what you have shown, and will help confirm the race condition. After that, if you want to try a reduction method, there are plenty of tutorials for that (one is linked above) and plenty of questions here on the cuda tag about it.

You could modify your kernel to something like this, to sort out the race condition:

__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){

    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int j = blockDim.y * blockIdx.y + threadIdx.y;

    if( i > 0 && i < resultSize - 1){

        TYPE start = (i >= size) ? (i % size ) + 1 : 0;
        TYPE end = (i + 1) >> 1;

        if(j >= start && j < end ){ // see note below

           atomicAdd(result+i, (( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] )));
        }
    }

}

Note that depending on your GPU type, and the actual type of TYPE you are using, this may not work (may not compile) as-is. But since you had previously used TYPE as a loop variable, I am assuming it is an integer type, and the necessary atomicAdd for those should be available.

A few other comments:

This may not be giving you the grid size you expect:

dim3 grid( (resultSize+1/32) , (resultSize+7/8) );

I think the usual calculations there would be:

dim3 grid( (resultSize+31)/32, (resultSize+7)/8 );

I always recommend proper CUDA error checking and running your codes with cuda-memcheck, any time you are having trouble with a CUDA code, to make sure there are no runtime errors.
It also looks to me like this:
```
if(j >= start && j <= end ){
```
should be this:
```
if(j >= start && j < end ){
```
to match your for-loop range. I am also making an assumption that size is less than resultSize (again, a MCVE would help).

Now I can see the result is always the same, but still wrong. Are you able to configure why? The _TYPE_ is just _int_ and I am pretty sure that in this case this type is ok. I belive the problem is in my grid dimension and block numbers. It seems to me that my code execute for every even coefficient of polynomial two times now. — , Apr 25 '18 at 22:00
Take a look at my notes 1 and 3 in the answer which I have edited. If those don't address the issue then I would need to see a [mcve] — Robert Crovella, Apr 25 '18 at 22:13

Karatsuba - polynomials multiplication with CUDA

1 Answers1