I am new to CUDA. I was using cuda to find the dot prod of float vectors and I came across a float point addition issue in cuda. In essence following is the simple kernel. I'm using -arch=sm_50 So the basic idea is for the thread_0 to add the values of vector a.
__global__ void temp(float *a, float *b, float *c) {
if (0 == threadIdx.x && blockIdx.x == 0 && blockIdx.y ==0 ) {
float xx = 0.0f;
for (int i = 0; i < LENGTH; i++){
xx += a[i];
}
*c = xx;
}
}
When I initialize 'a' with 1000 elements of 1.0 I get the desired result of 1000.00
but when I initialize 'a' with 1.1, I should get 1100.00xx but istead, I am getting 1099.989014. The cpu implementation simply yields 1100.000024
I am trying to understand what the issue here! :-(
I even tried to count the number of 1.1 elements in the a vector and that yeilds 1000, which is expected. and I even used atomicAdd and still I have the same issue.
would be very grateful if someone could help me out here!
best
EDIT: Biggest concern here is the disparity of the CPU result vs GPU result! I understand floats can be off by some decimal points. But the GPU error is very significant! :-(