Note that your values of x and y (expressed in decimal) are not exactly representable as binary (floating-point) numbers, within the range of a double
. The following program demonstrates this:
$ cat t556.cu
#include <stdio.h>
#include <math.h>
__global__ void mykernel(double x, double y, double *res){
*res = atan2(x,y);
}
int main(){
double h_x = -6.5799015957503127E+02;
double h_y = -4.6102597302044005E+03;
double *d_res;
double my_res = atan2(h_x, h_y);
cudaMalloc(&d_res, sizeof(double));
mykernel<<<1,1>>>(h_x, h_y, d_res);
double h_res;
cudaMemcpy(&h_res, d_res, sizeof(double), cudaMemcpyDeviceToHost);
printf("x = %.32lf\n", h_x);
printf("y = %.32lf\n\n", h_y);
printf("hst = %.32lf\n", my_res);
printf("dev = %.32lf\n\n", h_res);
printf("hst bin = 0x%lx\n", *(reinterpret_cast<unsigned long long *>(&my_res)));
printf("dev bin = 0x%lx\n", *(reinterpret_cast<unsigned long long *>(&h_res)));
return 0;
}
$ nvcc -arch=sm_20 -o t556 t556.cu
$ ./t556
x = -657.99015957503127083327854052186012
y = -4610.25973020440051186596974730491638
hst = -2.99982704867761151845684253203217
dev = -2.99982704867761107436763268196955
hst bin = 0xc007ffa552ddcff5
dev bin = 0xc007ffa552ddcff4
$
We see that when we specify x
according to what you have written in your question, and then print it out with a large number of digits, the printout does not match the value "presumably" assigned by the code.
The above program also demonstrates that the result of atan2
computed on the host and device differ by 1 bit, in the least significant place of the mantissa of the result.
Referring to the CUDA math documentation in the programming guide, we see that the maximum error of the atan2
function is 2 ULP (ULP = units in last place, for this discussion it is equivalent to the units represented by the least significant bit of the mantissa).
This means that the atan2
function (in CUDA device code) is not guaranteed to produce a numerically correct (for full arbitrary precision throughout) result, and the result it produces may differ by 1 or 2 ULP from a numerically correct (full arbitrary precision throughout) IEEE-754 implementation. This means that when comparing the CUDA implementation of atan2
to another implementation, it is reasonable to assume that there may be a difference of 1 or 2 mantissa LSB in the results.
If you require the atan2
result computed by the CUDA device to perfectly match (no mantissa bits are different) the atan2
result of another implementation, then the atan2
function provided by the CUDA math library will not be usable for you.
The only advice I can give you at that point would be to create your own implementation of atan2
, using more basic floating point operations throughout (presumably choosing ones throughout that offer 0 ULP error in the CUDA math library, although I am not an expert on how this would be done in detail), and numerical methods which are designed to match the implementation you are comparing to.
This may be an informative read as well. Note that the GNU implementations do not necessarily imply 0 ULP error across all math functions, or even all trig-type functions. For example note that cos
appears to have a maximum 1 ULP error on IA64. However atan2
appears to be at the lowest error level implied.