0

I am beginner to CUDA and trying to get the sum of the elements that reside in shared memory. The following is my kernel:

__global__ void columnaddition(DataIn data, double** shiftedData)
{
    int u = blockIdx.y;
    int v = threadIdx.x;
    int xu = blockIdx.x;
    extern __shared__ double columnShiftedData[];
    columnShiftedData[v] = *(*(shiftedData + (v*data.V) + u) + xu);

}

Here based on threadIdx and blockIdx.x and blockIdx.y I am loading the data to shared memory. To my understanding only one thread can be involved in getting the sum since it should be sequential. However now the kernel has been launched with V (assumed) threads so my question is how can I get the sum efficiently and what happens to the other threads inside the block.

paleonix
  • 2,293
  • 1
  • 13
  • 29
Aliya Clark
  • 131
  • 9
  • 2
    The algorithm you need is called s shared memory reduction and it is extremely well described, if you have installed the samples, then you have a complete working solution, and a paper discussing both theory and implementation – talonmies Nov 27 '16 at 14:00
  • The reduction tutorial is [here](http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf). – Robert Crovella Nov 27 '16 at 15:56
  • I went through the reduction tutorial [here](https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/). It covers about arrays with size of warp size or multiples of warp size. – Aliya Clark Nov 27 '16 at 17:29
  • 2
    And that tutorial is really not about shared memory reduction or even reducing the values in shared memory. For the block reduce portion, shared memory is being used to hold the warp sums temporarily, but the bulk of the reduction work is being done using warp-shuffle, which *does not use* shared memory. I'd suggest you read the tutorial I linked if you want to learn about summing elements in shared memory. – Robert Crovella Nov 27 '16 at 21:33

0 Answers0