I am beginner to CUDA and trying to get the sum of the elements that reside in shared memory. The following is my kernel:
__global__ void columnaddition(DataIn data, double** shiftedData)
{
int u = blockIdx.y;
int v = threadIdx.x;
int xu = blockIdx.x;
extern __shared__ double columnShiftedData[];
columnShiftedData[v] = *(*(shiftedData + (v*data.V) + u) + xu);
}
Here based on threadIdx
and blockIdx.x
and blockIdx.y
I am loading the data to shared memory. To my understanding only one thread can be involved in getting the sum since it should be sequential. However now the kernel has been launched with V
(assumed) threads so my question is how can I get the sum efficiently and what happens to the other threads inside the block.