cuda: different answer between cpu and gpu reduce

Question

I got this really weird error. I ran a sum over all elements in a matrix using thrust reduce. It ran well for most data, but it went wrong on one set.

code:

  lbfgsfloatval_t sum(const DeviceVector& A){
    thrust::device_ptr<lbfgsfloatval_t> ptr(A.getPtr());
    thrust::device_vector<double> A_p(ptr, ptr + A.rows()*A.cols());
    lbfgsfloatval_t sums = 0.0;

    // reduce on host
    for(int i = 0; i < A.rows()*A.cols();i++)
        sums += A_p[i];
    // reduce on device
    lbfgsfloatval_t res = thrust::reduce(A_p.begin(), A_p.end());
    cout << "cpu: " << sums << endl; 
    cout << "gpu: " << res  << endl;  
    return res;
 }

Notice the second group went wrong.

output:

cpu: -568.691
gpu: -568.691

cpu: 3.4972e-14
gpu: 1.40998e-14

cpu: 0.234375
gpu: 0.234375

I also tried not building thrust::device_vector, but use a raw pointer instead. Same output. I also tried cublas dot product. Same output.

I used matlab to confirm the cpu result above is correct.

What happened? Was it an underflow on GPU? Thanks!

Just how wrong is the result? What is the maximum absolute value of an element in the second array? Maybe it's just caused by a different order of summation? — fjarri, Sep 24 '13 at 12:35
The cpu result 3.4972e-14 is correct as confirmed by matlab. The memory lay out in gpu and cpu are the same: all in column format, unrolled as a long vector. So both sums are along columns. — user2684645, Sep 24 '13 at 12:39
If ``reduce`` is implemented as I think it is, the order of summation depends on the order the blocks are scheduled to multiprocessors, which is random. Matlab and your CPU code perform the summation sequentially. — fjarri, Sep 24 '13 at 12:43
The order is random. Depending on the magnitudes of the values in your array this may or may not matter. E.g. 1e40 - 1e40 + 1 != 1e40 + 1 - 1e40. — fjarri, Sep 24 '13 at 12:53
@user2684645 Sorry, I misunderstood the question. You are making reduction, while I was using FMAD's, so I mixed up things. Of course, disabling FMAD has no effect since there is no FMAD in your code. Have a look at the 2nd answer of [Precision in Sum reduction kernel with floats](http://stackoverflow.com/questions/15139656/precision-in-sum-reduction-kernel-with-floats). It quotes the paper [A Comparison Of Methods For Accurate Summation](http://www.sigsam.org/bulletin/articles/147/sumnums.pdf) which points out how "the summation of large sets of numbers is prone to serious rounding errors". — Vitality, Sep 24 '13 at 13:17
What GPU architecture (compute capability) are you compiling for and running on? — Robert Crovella, Sep 24 '13 at 15:52

score 3 · Accepted Answer · edited May 23 '17 at 11:54

3

I can only speculate on what could go wrong, but I would assume that is an underflow (or specifically, the difference in how CPUs and GPUs handle IEEE-754 denormalized numbers)

http://en.wikipedia.org/wiki/Denormal_number

Basically, CPUs handle them according to IEEE-754 standard, albeit very inefficiently.

GPUs, on the other hand, generally equate them to 0. I do not know if there is a CUDA way to force CPUs to also flush denormalized numbers for development purposes (I mostly do OpenCL), but the C/C++ way is usually

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Or, in gcc, compile with -ffast-math.

Check this SO question: Why does changing 0.1f to 0 slow down performance by 10x?

edited May 23 '17 at 11:54

Community

1
1

answered Sep 24 '13 at 12:36

qdot

6,195
5
44
95

1

You can pass `-ftz=true` to `nvcc` to [flush denormalized values to zero](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-cuda-compilation), or else `-ftz=false` to not flush denormalized values to zero, for single precision numbers. In this case the question indicates double precision. – Robert Crovella Sep 24 '13 at 15:50
4

In addition to what Robert Crovella said, -ftz=true is the compiler default when compiling for platforms >= sm_20. It very much looks like different summation order is responsible for the different results. To determine which result is more accurate (CPU or GPU) one would have to use a higher-precision reference. – njuffa Sep 24 '13 at 17:26

cuda: different answer between cpu and gpu reduce

1 Answers1