-1

I wrote a dot product code with CUDA to compute the dot product of two double vectors. The kernel was invoked by N threads(N<1024) 1 block. But it can't give correct results. I can't figure it out.

__global__ void dotthread(double* a, double *b,double *sum, int N)
  {
    int tid = threadIdx.x;  
    sum[0] = sum[0]+a[tid] * b[tid];  //every thread write to the sum[0] 
   __syncthreads();  
  }
ztdep
  • 343
  • 1
  • 4
  • 17
  • "every thread write to the sum[0] " is the problem. – talonmies May 28 '18 at 04:16
  • I have use the command "__syncthreads()" to avoid conflicts between these threads. – ztdep May 28 '18 at 04:20
  • That isn't what syncthreads does. There is no way of avoiding the memory race in that code. Either use an atomic memory transaction or a shared memory reduction – talonmies May 28 '18 at 04:23
  • I am really confused. __syncthreads() was used for all the threads in a block. why can't be used in this condition? – ztdep May 28 '18 at 04:26
  • The marked duplicate has a working version of a dot product kernel. study it carefully. when you understand how it works you will understand why your version doesn't work. – talonmies May 28 '18 at 04:38
  • I implement the dot product in a different way. It is two different questions. why minus my reputation? – ztdep May 28 '18 at 05:09
  • You mean "I implemented the dot product in a way which can never work correctly" . There is nothing to fix in your code. It will never work. The other question has a correct (and basically only) way of doing a block wise dot product – talonmies May 28 '18 at 05:13
  • At the minimum, very much like plain cpu multithreaded code, you should use an atomicAdd, as your sum is not atomic. There are many ways of performing a dot product depending on size of your data, how your device is busy doing something else, etc. CUB is probably a very good source. Since CC 3.0, using shared memory may not be necessary to get excellent performances, have a look at the shuffle presentation at GTC. – Florent DUGUET May 28 '18 at 09:16

2 Answers2

1

Let's look at two of your three lines of code:

    sum[0] = sum[0]+a[tid] * b[tid];  //every thread write to the sum[0] 
    __syncthreads();  

The first line contains a memory race. Every thread in the block will simultaneously attempt to write to sum[0]. There is nothing in the cuda execution model which can stop this from happening. There is no automatic serialization or memory protections which can stop this behaviour.

The second line is an instruction barrier. This means that each warp of threads will be blocked until every warp of threads has reached the barrier. It has no effect whatsoever on prior instructions, it has no effect whatsoever on memory consistency or on the behaviour of any memory transactions which your code issues.

The code you have written is irreversibly broken. The canonical way to perform this sort of operation would be via a parallel reduction. There are a number of different ways this can be done. It is probably the most described and documented parallel algorithm for GPUs. It you have installed the CUDA toolkit, you already have both a complete working example and a comprehensive paper describing the algorithm as it would be implemented using shared memory. I suggest you study it.

You can see an (almost) working implementation of a dot product using shared memory here which I recommend you study as well. You can also find optimal implementations of the parallel block reduction in libraries such as cub

talonmies
  • 70,661
  • 34
  • 192
  • 269
-1

I wrote two versions of dot product procedures. The first one uses the atomiAdd function, the second one allocates one shared variable for each block. The computational time is 3.33 ms and 0.19 ms respectively compared to 0.17 ms and 411.43 ms of the reduction dot product and a one thread dot product.

GPU Device 0: "GeForce GTX 1070 Ti" with compute capability 6.1
2000000flow array allocated 2147483647
 naive elapsedTimeIn Ms 411.437042 Ms
sum is 6.2831853071e+06

thread atomic add elapsedTimeIn Ms 3.3371520042 Ms
sum is 6.2831853071e+06

 cache reduction elapsedTimeIn Ms 0.1764480025 Ms
sum is 6.2831853072e+06

 cache atomicadd elapsedTimeIn Ms 0.1914239973 Ms
sum is 6.2831853072e+06

__global__ void dotatomicAdd(double* a, double *b,double *sum, int N)
{
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    while (tid < N){
        double t=a[tid] * b[tid];
        atomicAdd(sum,t);
        tid+=blockDim.x * gridDim.x;
    }
}


__global__ void dotcache(double* a, double *b,double *c, int N)
{

    __shared__ double cache;
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int cacheIndex = threadIdx.x;
    double  temp = 0.0;
    cache=0.0;
    __syncthreads();

    while (tid < N) {
        temp += a[tid] * b[tid];
        tid += blockDim.x * gridDim.x;
    }
    atomicAdd(&cache,temp);
    __syncthreads();

    if (cacheIndex == 0) c[blockIdx.x] = cache;
}
ztdep
  • 343
  • 1
  • 4
  • 17
  • You should be aware that double precision atomic operations are only supported on compute capability 6.x and newer hardware, so this code won't work on the majority of CUDA hardware in the wild. Also for your bendchmarks, what are size and number of vectors? The reference block dot I use takes 4 *microseconds* for the 1024 element case you showed in the original question – talonmies May 29 '18 at 09:05
  • Dear sir: I use cuda-9.2 on the 1070Ti platform. It supports the atomiAdd (double*, double). GPU Device 0: "GeForce GTX 1070 Ti" with compute capability 6.1 2000000flow array allocated – ztdep May 29 '18 at 10:59