-3

This is my sequential code:

float foo(float* in1, float* in2, float in3, unsigned int size) {
    float tmp = 0.f;
        for (int i = 0; i<size; i++)
          if(in2[i]>0)tmp += (in1[i]/in3 - (in2[i] /in3)*(in2[i] /in3));
    return tmp;
}

This is my effort to port it to CUDA:

__global__ void kernel_foo(float* tmp, const float* in1, const float* 
                           in2, float in3,  unsigned int size) {
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < size) {
        if(in2[i]>0){
        atomicAdd(tmp, in1[i]/in3 - (in2[i] /in3)*(in2[i] /in3));
        }
    }
}

void launch_kernel_foo(float* tmp, const float* in1, const float* in2,
                       float in3,  unsigned int size) {
  kernel_foo<<<(size+255)/256,256>>>(tmp, in1, in2, in3, size);
}

but it does't work to generate correct results. Could anyone tell me where is the mistake?

talonmies
  • 70,661
  • 34
  • 192
  • 269
R.GH
  • 33
  • 5
  • 2
    What doesn't work? What is the expected result and what is the current result? Have you initialized the device variable `tmp` to zero before launching the kernel? – sgarizvi Feb 16 '16 at 08:55
  • Of course it does work, but its result is wrong, No i have not initialized the device variable tmp, how to could i initialize it with zero? – R.GH Feb 16 '16 at 08:58
  • 1
    But you have written in the question that **it doesn't work**. Initialize `tmp` to `0` before launching the kernel using `cudaMemset(tmp, 0, sizeof(float));`. – sgarizvi Feb 16 '16 at 09:00
  • excuse me, i had forgotten it, now i edited it, yes i must be cudaMemset may i ask you write an example for use of cudaMemset? – R.GH Feb 16 '16 at 09:04

1 Answers1

3

The reason for incorrect result is that the output variable tmp has not been initialized before launching the kernel. It should be initialized to 0 before performing calculations. It can be done as follows.

void launch_kernel_foo(float* tmp, const float* in1, const float* in2,
                       float in3,  unsigned int size) {
  cudaMemset(tmp, 0, sizeof(float));
  kernel_foo<<<(size+255)/256,256>>>(tmp, in1, in2, in3, size);
} 

It is highly recommend to add CUDA error checking in your code as described in this post.

Community
  • 1
  • 1
sgarizvi
  • 16,623
  • 9
  • 64
  • 98
  • 2
    @R.GH: It is a simple proposition. Provide a complete MCVE and someone might be able to give you a definite answer to your question. If you do don't, the amount of help you can realistically expect to receive here is limited. It is completely your choice. – talonmies Feb 16 '16 at 16:02