2

I got this really weird error. I ran a sum over all elements in a matrix using thrust reduce. It ran well for most data, but it went wrong on one set.

code:

  lbfgsfloatval_t sum(const DeviceVector& A){
    thrust::device_ptr<lbfgsfloatval_t> ptr(A.getPtr());
    thrust::device_vector<double> A_p(ptr, ptr + A.rows()*A.cols());
    lbfgsfloatval_t sums = 0.0;

    // reduce on host
    for(int i = 0; i < A.rows()*A.cols();i++)
        sums += A_p[i];
    // reduce on device
    lbfgsfloatval_t res = thrust::reduce(A_p.begin(), A_p.end());
    cout << "cpu: " << sums << endl; 
    cout << "gpu: " << res  << endl;  
    return res;
 }

Notice the second group went wrong.

output:

cpu: -568.691
gpu: -568.691

cpu: 3.4972e-14
gpu: 1.40998e-14

cpu: 0.234375
gpu: 0.234375

I also tried not building thrust::device_vector, but use a raw pointer instead. Same output. I also tried cublas dot product. Same output.

I used matlab to confirm the cpu result above is correct.

What happened? Was it an underflow on GPU? Thanks!

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
user2684645
  • 501
  • 1
  • 7
  • 15
  • Just how wrong is the result? What is the maximum absolute value of an element in the second array? Maybe it's just caused by a different order of summation? – fjarri Sep 24 '13 at 12:35
  • The cpu result 3.4972e-14 is correct as confirmed by matlab. The memory lay out in gpu and cpu are the same: all in column format, unrolled as a long vector. So both sums are along columns. – user2684645 Sep 24 '13 at 12:39
  • If ``reduce`` is implemented as I think it is, the order of summation depends on the order the blocks are scheduled to multiprocessors, which is random. Matlab and your CPU code perform the summation sequentially. – fjarri Sep 24 '13 at 12:43
  • 4
    The order is random. Depending on the magnitudes of the values in your array this may or may not matter. E.g. 1e40 - 1e40 + 1 != 1e40 + 1 - 1e40. – fjarri Sep 24 '13 at 12:53
  • I added -fmad=false but it still gave the same result. – user2684645 Sep 24 '13 at 12:53
  • 1
    @user2684645 Sorry, I misunderstood the question. You are making reduction, while I was using FMAD's, so I mixed up things. Of course, disabling FMAD has no effect since there is no FMAD in your code. Have a look at the 2nd answer of [Precision in Sum reduction kernel with floats](http://stackoverflow.com/questions/15139656/precision-in-sum-reduction-kernel-with-floats). It quotes the paper [A Comparison Of Methods For Accurate Summation](http://www.sigsam.org/bulletin/articles/147/sumnums.pdf) which points out how "the summation of large sets of numbers is prone to serious rounding errors". – Vitality Sep 24 '13 at 13:17
  • 1
    What GPU architecture (compute capability) are you compiling for and running on? – Robert Crovella Sep 24 '13 at 15:52

1 Answers1

3

I can only speculate on what could go wrong, but I would assume that is an underflow (or specifically, the difference in how CPUs and GPUs handle IEEE-754 denormalized numbers)

http://en.wikipedia.org/wiki/Denormal_number

Basically, CPUs handle them according to IEEE-754 standard, albeit very inefficiently.

GPUs, on the other hand, generally equate them to 0. I do not know if there is a CUDA way to force CPUs to also flush denormalized numbers for development purposes (I mostly do OpenCL), but the C/C++ way is usually

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Or, in gcc, compile with -ffast-math.

Check this SO question: Why does changing 0.1f to 0 slow down performance by 10x?

Community
  • 1
  • 1
qdot
  • 6,195
  • 5
  • 44
  • 95
  • 1
    You can pass `-ftz=true` to `nvcc` to [flush denormalized values to zero](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-cuda-compilation), or else `-ftz=false` to not flush denormalized values to zero, for single precision numbers. In this case the question indicates double precision. – Robert Crovella Sep 24 '13 at 15:50
  • 4
    In addition to what Robert Crovella said, -ftz=true is the compiler default when compiling for platforms >= sm_20. It very much looks like different summation order is responsible for the different results. To determine which result is more accurate (CPU or GPU) one would have to use a higher-precision reference. – njuffa Sep 24 '13 at 17:26