I'm using CUDA for the iterative Karatsuba algorithm and I would like to ask, why is one line computed always different.
First, I implemented this function, which computed the result always correctly:
__global__ void kernel_res_main(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) / 2;
for(TYPE inner = start; inner < end; inner++){
result[i] += ( A[inner] + A[i - inner] ) * ( B[inner] + B[i - inner] );
result[i] -= ( D[inner] + D[i-inner] );
}
}
}
Now I would like to use the 2D grid and use CUDA for the for-loop, so I changed my function to this:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
TYPE rtmp = result[i];
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j <= end ){
// WRONG
rtmp += ( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] );
}
}
result[i] = rtmp;
}
I am calling this function like this:
dim3 block( 32, 8 );
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
kernel_res_nested <<<grid, block>>> (devA, devB, devD, devResult, size, resultSize);
And the result is alway wrong and always different. I can't understand why is that second implementation wrong and always computes wrong results. I can't see there any logical problem connected with data dependency. Does anyone know How can I solve this problem?