Why does atan2 in CUDA kernel results in slightly different values than C/fortran?

Question

please, help me with this problem:

For x = -6.5799015957503127E+02 and y = -4.6102597302044005E+03

// in C:
atan2(x,y) = -2.99982704867761151845684253203217e+00

// in Fortran:
atan2(x,y) = -2.99982704867761151845684253203217D+00

// But atan2 called in CUDA kernel is:
atan2(x,y) = -2.99982704867761107436763268196955E+00
//                             ^^^^^^^^^^^^^^^^^^^^^

Yes, it could be due to round off errors, but, why is the result same in Fortran and C, and in CUDA slightly differs?

I need in CUDA the same number of atan2 as in Fortran and C. How to do that?

If you're relying on exact values, you shouldn't be using floating-point. — Oliver Charlesworth, Sep 12 '14 at 07:47
Yes, I use double and double operations. I tried to use datan2 in Fortran to be sure and it is the same. But, why CUDA gives different results? — user3906895, Sep 12 '14 at 08:07
Related : http://stackoverflow.com/questions/10334334/ieee-754-standard-on-nvidia-gpu-sm-13 — Michael M., Sep 12 '14 at 08:11
I use GPU arch sm_20. And all functions in my project (sin, cos, sinh, cosh, etc.) give same results as in C. So, why atan2 does not???? :( — user3906895, Sep 12 '14 at 08:15
@OliCharlesworth This is an over-simplification. There are many valid reasons to expect repeatable floating-point results, which is not the same thing as expecting effortless exact mathematical computations with floating-point. One simple way to have repeatable trigonometric results is to embed the math library into the program, so that the same errors are made every time. Java solved the problem this way twenty years ago, so it is neither unreasonable nor new. — Pascal Cuoq, Sep 12 '14 at 08:31
I would suggest being careful with the use of "repeatable" to avoid conflating "repeatable" and "comparable". CUDA numerical results should be repeatable. They are not necessarily comparable to those produced by other implementations. Or you could say "repeatable across different implementations" which I think is well expressed by "comparable" or "portable". — Robert Crovella, Sep 12 '14 at 15:59
@PascalCuoq: Of course, but it's probably an appropriate simplification in this case, where the OP believes that he/she "needs the same number". Notwithstanding that the OP is comparing two completely different implementations, so your suggestion isn't really an option. — Oliver Charlesworth, Sep 12 '14 at 20:47
This is not an answer, but beware of geometrical interpretation of this atan2(x,y): we usually write atan2(y,x) if we want to produce an angle zero on positive x-axis (x>0,y==0), and +pi/2 on positive y-axis (x==0,y>0). — aka.nice, Sep 14 '14 at 20:24

score 3 · Answer 1 · edited May 23 '17 at 11:57

Note that your values of x and y (expressed in decimal) are not exactly representable as binary (floating-point) numbers, within the range of a double. The following program demonstrates this:

$ cat t556.cu
#include <stdio.h>
#include <math.h>

__global__ void mykernel(double x, double y, double *res){

  *res = atan2(x,y);
}

int main(){

  double h_x = -6.5799015957503127E+02;
  double h_y = -4.6102597302044005E+03;
  double *d_res;
  double my_res = atan2(h_x, h_y);

  cudaMalloc(&d_res, sizeof(double));
  mykernel<<<1,1>>>(h_x, h_y, d_res);
  double h_res;
  cudaMemcpy(&h_res, d_res, sizeof(double), cudaMemcpyDeviceToHost);

  printf("x = %.32lf\n", h_x);
  printf("y = %.32lf\n\n", h_y);
  printf("hst = %.32lf\n", my_res);
  printf("dev = %.32lf\n\n", h_res);

  printf("hst bin = 0x%lx\n", *(reinterpret_cast<unsigned long long *>(&my_res)));
  printf("dev bin = 0x%lx\n", *(reinterpret_cast<unsigned long long *>(&h_res)));


  return 0;

}
$ nvcc -arch=sm_20 -o t556 t556.cu
$ ./t556
x = -657.99015957503127083327854052186012
y = -4610.25973020440051186596974730491638

hst = -2.99982704867761151845684253203217
dev = -2.99982704867761107436763268196955

hst bin = 0xc007ffa552ddcff5
dev bin = 0xc007ffa552ddcff4
$

We see that when we specify x according to what you have written in your question, and then print it out with a large number of digits, the printout does not match the value "presumably" assigned by the code.

The above program also demonstrates that the result of atan2 computed on the host and device differ by 1 bit, in the least significant place of the mantissa of the result.

Referring to the CUDA math documentation in the programming guide, we see that the maximum error of the atan2 function is 2 ULP (ULP = units in last place, for this discussion it is equivalent to the units represented by the least significant bit of the mantissa).

This means that the atan2 function (in CUDA device code) is not guaranteed to produce a numerically correct (for full arbitrary precision throughout) result, and the result it produces may differ by 1 or 2 ULP from a numerically correct (full arbitrary precision throughout) IEEE-754 implementation. This means that when comparing the CUDA implementation of atan2 to another implementation, it is reasonable to assume that there may be a difference of 1 or 2 mantissa LSB in the results.

If you require the atan2 result computed by the CUDA device to perfectly match (no mantissa bits are different) the atan2 result of another implementation, then the atan2 function provided by the CUDA math library will not be usable for you.

The only advice I can give you at that point would be to create your own implementation of atan2, using more basic floating point operations throughout (presumably choosing ones throughout that offer 0 ULP error in the CUDA math library, although I am not an expert on how this would be done in detail), and numerical methods which are designed to match the implementation you are comparing to.

This may be an informative read as well. Note that the GNU implementations do not necessarily imply 0 ULP error across all math functions, or even all trig-type functions. For example note that cos appears to have a maximum 1 ULP error on IA64. However atan2 appears to be at the lowest error level implied.

+1 Thanks for the link to the CUDA documentation. I am a bit disappointed with their choice of specifying accuracy as a distance to the correctly rounded result: 0 ULP in their manual means 0.5 ULP everywhere else, similarly 1 ULP can be anything between an almost-correctly-rounded faithful implementation and a horrible implementation (in)accurate to 1.5 ULP. Error bounds in fractions of ULPs can be very useful in practice, especially as new good compromises between accuracy and speed are discovered. — Pascal Cuoq, Sep 12 '14 at 18:35

Pascal Cuoq · Answer 2 · 2014-09-12T08:57:01.817

I use GPU arch sm_20. And all functions in my project (sin, cos, sinh, cosh, etc.) give same results as in C. So, why atan2 does not???? :(

Most implementations of trigonometric functions nowadays are precise to a trifle above 0.5 ULP, meaning that in 99% of cases, the exact mathematical result has only one representable floating-point approximation that they can return (which is the closest approximation to the real result).

However, you shouldn't assume that any trigonometric function is perfect, that is, accurate to 0.5 ULP, unless you have chosen your math library for this property. This means that, for some rare arguments where the mathematical result is nearly exactly in-between two representable double, a trigonometric function can return the wrong one (say the one that is at a distance of 0.507 ULP instead of the one that is at a distance of 0.493 ULP).

This also means that two different implementations can return different results (an implementation can correctly return the result to 0.493 ULP and the other the wrong result to 0.507 ULP).

This can happen for all trigonometric functions. You just happened to run into this problem with atan2, but the same thing could have happened to you with sin or cos. It may be the case that in one of the libraries you use, atan2 is implemented less accurately (say, to 0.52 ULP instead of 0.505 ULP), making the issue more likely to be noticed. But unless you use correctly rounded libraries on both sides, or the same library (which will not be correctly rounded but will produce the same errors on both sides), this will happen from time to time.

One example of correctly rounded math library, which produces the same results as any other correctly rounded math library, is CRlibm. One example of math library that isn't too bad and is frequently embedded into programs so that they produce the same results everywhere is netlib.

@user2864740 “ULP” is the distance between a floating-point number and its immediate neighbors. For instance, around `0.75`, the ULP is 2^-53, so an arbitrary number around 0.75 cannot be expected to be represented more accurately than to half of 2^-53. There is usually only one number within 0.5 ULP (except in the case of mathematical numbers that are exact midpoints between two floating-point numbers), but sometimes, the two nearest floating-point numbers are at distance 0.493 ULP and 0.507 ULP on each side of the real number, and it is difficult to be precise enough to pick the correct one. — Pascal Cuoq, Sep 12 '14 at 08:48

Why does atan2 in CUDA kernel results in slightly different values than C/fortran?

2 Answers2