How to get rid of different results of logarithm between CUDA and CPU?

Question

I want to realize an algorithm in GPU using CUDA. At the same time, I write a CPU edition using C++ to verify the results of GPU edition. However I got into trouble when using log() in CPU and GPU. A very simple piece of algorithm (used both on CPU and GPU) is shown below:

float U;
float R = U * log(U);

However, when I compare the results on CPU side, I find that there are many results (459883 out of 1843161) having small differences (max dif is 0.5). Some results are shown below:

U       -- R (CPU side)  -- R (GPU side)  -- R using Python (U * math.log(U))

86312.0 -- 980998.375000 -- 980998.3125   -- 980998.3627440572
67405.0 -- 749440.750000 -- 749440.812500 -- 749440.7721980268
49652.0 -- 536876.875000 -- 536876.812500 -- 536876.8452369706
32261.0 -- 334921.250000 -- 334921.281250 -- 334921.2605240216
24232.0 -- 244632.437500 -- 244632.453125 -- 244632.4440747978

Can anybody give me some suggestions? Which one should I trust?

If you want better precision, use `double` instead of `float`. `float` holds about 7 decimal digits, so it is perfectly normal to have differences in sixth or seventh digit on different architectures. — aland, Jan 02 '14 at 16:58
Potentially relevant: http://www.herikstad.net/2009/05/cuda-and-double-precision-floating.html; CUDA floats are not the same as IEEE 754 floats. — dyoo, Jan 02 '14 at 16:58
You should trust neither CPU nor GPU results because both of them are received using computer arithmetics. But CUDA results are usually less accurate because of more simple and more fast implementation of such functions. — Constructor, Jan 02 '14 at 16:59
While I love GPUs, I'd be inclined to trust the CPU without more investigation. NVIDIA publishes the accuracy of their CUDA mathematical errors in the [CUDA programming guide](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#mathematical-functions-appendix). — radical7, Jan 02 '14 at 16:59
@dyoo Actually, modern (starting with `sm_20`) GPUs [do conform to IEEE754](http://stackoverflow.com/a/10335070/929437) — aland, Jan 02 '14 at 17:02
@aland That's for the six basic operations +, -, *, /, sqrt and fma. The logarithm is implemented in software, so its accuracy is what the accuracy it was designed for. And the kind of non-vector implementations of such functions that run on CPUs have the opportunity to make the result accurate enough by using different execution paths for different inputs, whereas the sort of vector implementations that run on GPUs typically use a fixed execution path, even if that makes them a bit worse for some inputs. — Pascal Cuoq, Jan 02 '14 at 17:28
@aland “different execution paths” include different number of iterations for Newton or Ziv methods, and different polynomials approximations and/or arithmetic tricks as in http://www.netlib.org/fdlibm/k_cos.c — Pascal Cuoq, Jan 02 '14 at 17:36
Your GPU and CPU results agree to 6-7 decimal digits when using `float`. That's pretty close to the precision expected from `float` - 7,22 decimal digits. But CPU results could also differ, e.g. the x87 FPU uses internally 80-bit extended precision to perform all operations (including operations on `float`s) while SSE and AVX use 32- or 64-bit internal precision. Therefore 32-bit x86 code could give different results than 64-bit x86 code. — Hristo Iliev, Jan 02 '14 at 22:52
My commentary on comparing CPU and GPU floating point results can be found here: http://stackoverflow.com/a/13948228/655457 — ArchaeaSoftware, Jan 03 '14 at 02:49

score 5 · Accepted Answer · answered Jan 02 '14 at 17:48

5

Which one should I trust?

You should trust the double-precision result computed by Python, that you could also have computed with CUDA or C++ in double-precision to obtain very similar (although likely not identical still) values.

To rephrase the first comment made by aland, if you care about an error of 0.0625 in 980998, you shouldn't be using single-precision in the first place. Both the CPU and the GPU result are “wrong” for that level of accuracy. On your examples, the CPU result happens to be more accurate, but you can see that both single-precision results are quite distant from the more accurate double-precision Python result. This is simply a consequence of using a format that allows 24 significant binary digits (about 7 decimal digits), not just for the input and the end result, but also for intermediate computations.

If the input is provided as float and you want the most accurate float result for R, compute U * log(U) using double and round to float only in the end. Then the results will almost always be identical between CPU and GPU.

answered Jan 02 '14 at 17:48

Pascal Cuoq

79,187
7
161
281

1

It's very good answer, but there is another solution: try the other way around and force single precision float evaluation at CPU side like in my answer; that might lead to faster but more fragile code (in the sense that two different implementation of log might well answer two different results) – aka.nice Jan 03 '14 at 00:58
thanks a lot! I followed your suggestion that initial U as double and calculate R as float. there are no different answers! – oilpig Jan 03 '14 at 01:09
1

In general, one should not expect bit-identical results across any two platforms. I suggest reading the following white paper, as well as the references it cites: http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf – njuffa Jan 03 '14 at 01:30
@njuffa Allowing bit-identical results across platforms is the best argument I can think of for correctly rounded elementary functions, although one must admit that the current implementations do not lend themselves to vectorization and that most users do not care enough to pay the price in terms of large tables of coefficients or occasional time penalty. – Pascal Cuoq Jan 03 '14 at 11:14
Correctly-rounded transcendental functions are hard, and expensive. Most users do not want to pay for such a feature. Plus it still does not guarantee bit-accurate results across IEEE-754 compliant platforms, as there are other issues to worry about, e.g. double-rounding on Linux platforms using x87. – njuffa Jan 03 '14 at 19:21

aka.nice · Answer 2 · 2014-01-03T00:46:34.793

By curiosity, I compared the last bit set in the significand (or in other words the number of trailing zeros in the significand)
I did it with Squeak Smalltalk because I'm more comfortable with it, but I'm pretty sure you can find equivalent libraries in Python:

CPU:

#(980998.375000 749440.750000 536876.875000 334921.250000 244632.437500)
    collect: [:e | e asTrueFraction numerator highBit].

-> #(23 22 23 21 22)

GPU:

#(980998.3125 749440.812500 536876.812500 334921.281250 244632.453125)
    collect: [:e | e asTrueFraction numerator highBit].

-> #(24 24 24 24 24)

That's interestingly not as random as we could expect, especially the GPU, but there is not enough clue at this stage...

Then I used an ArbitraryPrecisionFloat package to perform (emulate) the operations in extended precision, then round to nearest single precision float, the correct answer matches quite exactly the one of CPU:

#( 86312 67405 49652 32261 24232 ) collect: [:e |
    | u r |
    u := e asArbitraryPrecisionFloatNumBits: 80.
    r = u*u ln.
    (r asArbitraryPrecisionFloatNumBits: 24) asTrueFraction printShowingMaxDecimalPlaces: 100]

-> #('980998.375' '749440.75' '536876.875' '334921.25' '244632.4375')

It works as well with 64 bits.

But if I emulate the operations in single precision, then I can say the GPU matches the emulated results quite well too (except the second item):

#( 86312 67405 49652 32261 24232 ) collect: [:e |
    | u r | 
    u := e asArbitraryPrecisionFloatNumBits: 24.
    r = u*u ln.
    r asTrueFraction printShowingMaxDecimalPlaces: 100]

-> #('980998.3125' '749440.75' '536876.8125' '334921.28125' '244632.453125')

So I'd say the CPU did probably use a double (or extended) precision to evaluate the log and perform the multiplication.

On the other side, the GPU did perform all the operations in single precision. Then the log function of ArbitraryPrecisionFloat package is correct to half ulp, but that's not a requirement of IEEE 754, so that can explain the observed mismatch on second item.

You may try to write the code so as to force float (like using logf instead of log if it's C99, or use intermediate results float ln=log(u); float r=u*ln;) and eventually use appropriate compilation flags to forbid extended precision (can't remember, I don't use C every day). But then you have very few guaranty to obtain 100% match on log function, the norms are too lax.

How to get rid of different results of logarithm between CUDA and CPU?

2 Answers2