Division of floating point numbers on GPU different from that on CPU

Question

When I divide two floating point numbers on the GPU, the result is 0.196405. When I divide them on CPU, the result is 0.196404. The actual value using the calculator is 0.196404675. How do I make the division on the GPU and the CPU same?

Why do you need them to be the same? My gut says that if you need them to be equivalent, you should adjust your significant digits when interpreting the results, not when calculating them. — Patrick87, Dec 18 '12 at 16:38
What precision, what GPU, what source numbers and most importantly why? — talonmies, Dec 18 '12 at 16:47
Please show code and compiler options and identify the GPU. Double-precision division in CUDA always uses IEEE-754 rounding, however the CPU may use extended precision internally, leading to a problem called double rounding when it returns the double precision result. Single-precision division in CUDA uses IEEE-754 rounding by default for sm_20 and up. Various compiler options can lead to the use of approximate single-precision division, and sm_1x platforms always use approximate division for the single-precision division operator (you can use intrinsics to get IEEE-754 rounded division). — njuffa, Dec 18 '12 at 18:59
@talonmies: I am doing floating point division. The GPU is GeForce GT 540M. As for the why, I know my CPU implementation is correct. Just want to check if my GPU implementaion is correct by comparing outputs. — Programmer, Dec 19 '12 at 09:43
@Programmer: single or double precision is what I was asking — talonmies, Dec 19 '12 at 09:47
The GT 540M is a compute capability 2.1 part. If you are compiling for this architecture with -arch=sm_20, both the single and double precision division operator '/' will map to divisions that are rounded according to IEEE-754 (round-to-nearest-or-even) by default. If you use the compiler flags -prec-div=false or -use_fast_math (which implies -prec-div=false) approximate single precision divisions and reciprocals will be generated instead. When using those flags you could still get properly rounded division by invoking the intrinsic __fdiv_rn(). — njuffa, Dec 19 '12 at 18:37
@Programmer If you're just using the CPU implementation to check that the GPU results are correct by comparison, then you should probably follow the advice that I gave in my answer below and just use a tolerance. Rather than checking if `a == b`, check if `(a - b < t) && (a - b > -t)` for some positive tolerance `t`. Or, if you prefer the tolerance to be a percentage (so that it scales with the magnitude of the numbers,) you can check if |a - b| < |a| * t, where t is a fraction (e.g. 0.01 for 1% tolerance, 0.0001 for 0.01%, etc.) — reirab, Dec 20 '12 at 06:01
recently asked relevant question: [If I copy a float to another variable, will they be equal?](https://stackoverflow.com/questions/59710531/if-i-copy-a-float-to-another-variable-will-they-be-equal) — OfirD, Jan 24 '20 at 11:24

ArchaeaSoftware · Accepted Answer · 2012-12-19T17:05:38.603

As the comments to another answer suggest, there are many reasons why it is not realistic to expect the same results from floating point computations run on the CPU and GPU. It's much stronger than that: you can't assume that FP results will be the same when the same source code is compiled against a different target architecture (e.g. x86 or x64) or with different optimization levels, either.

In fact, if your code is multithreaded and the FP operations are performed in different orders from one run to the next, then the EXACT SAME EXECUTABLE running on the EXACT SAME SYSTEM may produce slightly different results from one run to the next.

Some of the reasons include, but are not limited to:

floating point operations are not associative, so seemingly-benign reorderings (such as the race conditions from multithreading mentioned above) can change results;
different architectures support different levels of precision and rounding under different conditions (i.e. compiler flags, control word versus per instruction);
different compilers interpret the language standards differently, and
some architectures support FMAD (fused multiply-add) and some do not.

Note that for purposes of this discussion, the JIT compilers for CUDA (the magic that enables PTX code to be future-proof to GPU architectures that are not yet available) certainly should be expected to perturb FP results.

You have to write FP code that is robust despite the foregoing.

As I write this today, I believe that CUDA GPUs have a much better-designed architecture for floating point arithmetic than any contemporary CPU. GPUs include native IEEE standard (c. 2008) support for 16-bit floats and FMAD, have full-speed support for denormals, and enable rounding control on a per-instruction basis rather than control words whose settings have side effects on all FP instructions and are expensive to change.

In contrast, CPUs have an excess of per-thread state and poor performance except when using SIMD instructions, which mainstream compilers are terrible at exploiting for performance (since vectorizing scalar C code to take advantage of such instruction sets is much more difficult than building a compiler for a pseudo-scalar architecture such as CUDA). And if the wikipedia History page is to be believed, Intel and AMD appear to have completely botched the addition of FMAD support in a way that defies description.

You can find an excellent discussion of floating point precision and IEEE support in NVIDIA GPUs here:

https://developer.nvidia.com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus

I disagree with some of the criticism of CPU architectures. Scalar SSE is plenty fast compared with scalar integer arithmetic, due to the impressive CPU pipeline. 16 bit floats are only supported high end GPUs, they remain a niche use case. Personal tests with FMA on CPUs showed <5% performance increase, so it doesn't matter that much if compilers don't use them. That being said, the difficulty of vectorization on CPUs is real. This is where GPUs specialize; CPUs can't compete. Witness the failure of Intel MIC. — Benjamin, Jun 29 '18 at 18:06
SSE was introduced in 1998. Maybe you are referring to AVX or AVX-512? Doesn't ISA-level exposure of the SIMD width strike you as architecturally inferior to CUDA's SIMT model? Which FMA extension are you referring to? The 3- or 4-operand one, introduced by AMD or Intel? 16-bit float conversion has been available on GPUs since before CUDA came out. The first Intel CPUs to support conversion (F16C extension) began volume manufacture in late 2011, 5 years after the first CUDA-capable GPU. FP16 arithmetic came much later, but GPUs also lead CPUs there. — ArchaeaSoftware, Jul 12 '18 at 19:33
"GPUs... have full-speed support for denormals" is incorrect. That's the case (since Fermi) for FMA, but not the case e.g. for div, sqrt or rsqrt. — ZachB, Jul 22 '20 at 21:49
It would be nice to see a citation to support this claim. In a post contrasting CPUs and GPUs, it's important to remember that CPUs revert to *microcode*, at a performance cost of several orders of magnitude (say, 100x), if a denormal is encountered either as input or as output. GPUs do not incur a similar data-dependent performance penalty. — ArchaeaSoftware, Sep 25 '20 at 19:14

score 1 · Answer 2 · answered Dec 18 '12 at 19:00

1

You don't. You should never assume that floating point values will be exactly equal to what you expect after mathematical operations. They are only defined to be correct to a specified precision and will vary slightly from processor to processor, regardless of whether that processor is a CPU or a GPU. An x86 processor, for instance, will actually do floating point computations with 80 bits of precision by default and will then truncate the result to the requested precision. Equivalence comparisons for floating point numbers should always use a tolerance, since no guarantee can be made that any two processors (or even the same processor through different but mathematically equivalent sequences of instructions) will produce the same result. E.g. floating-point numbers a and b should be considered equal if and only if | a - b | < t for some tolerance t.

answered Dec 18 '12 at 19:00

reirab

1,535
14
32

1

No argument with your answer, however as a point of fact, it's been a while since 80-bit was the norm for x86 CPUs. Most modern performance-oriented x86 FP code [does not use 80 bits](http://stackoverflow.com/questions/3206101/extended-80-bit-double-floating-point-in-x87-not-sse2-we-dont-miss-it). – Robert Crovella Dec 18 '12 at 19:10
`They are only defined to be correct to a specified precision and will vary slightly from processor to processor, regardless of whether that processor is a CPU or a GPU.` I don't think this is correct. As far as I understand, IEEE 754 defines the exact bit sequence result of all supported operations. Here's a good page about [floating point determinism](http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/). Scroll down to see comments from people that have implemented systems that depend on exact reproducibility across platforms. – Roger Dahl Dec 18 '12 at 20:36
@Roger Dahl Yes, I was aware of the IEEE standard, but not every processor actually follows it perfectly. The bit layout is pretty universal, but I've seen some processors that didn't follow the rounding specifications exactly. Even if the process is entirely deterministic though, it's still only defined to be correct to the specified precision. As such, performing a different but mathematically equivalent set of instructions is not defined to produce equal results, thus equality comparisons should use a tolerance. – reirab Dec 18 '12 at 20:54
@RobertCrovella Yeah, good point about SSE usually being used now for performance intensive code (and the SSE registers obviously don't contain 80 bits per value.) I was just using the old x86 FPU as an example of a common FPU that doesn't follow the IEEE standard exactly. If you don't use SSE, though, 80-bit is still the default, AFAIK. – reirab Dec 18 '12 at 21:02
3

It's more complicated than that even if you stick to the x87 FPU. While the x87 FPU has 80-bit registers, mantissa precision can be set through the PC (precision control) field in the CW (control word). Traditionally the PC was set to extended precision for Linux platforms and double precision for Windows platforms, leading to all kinds of interesting mismatches. Even SSE can have surprises in store: One compiler I use sets FTZ (flush-to-zero) mode for single precision SSE computations by default at -O1 or higher. – njuffa Dec 19 '12 at 07:50

score 1 · Answer 3 · answered Dec 19 '12 at 09:13

1

Which GPU is used for computation ?

Normally there will be a precision error of +1/-1 in the sixth place of the mantissa part if you are using the single precision floating point operation. this is because of the rounding off error in the GPU.

if you are using the double precision, you will get the exact precision what you are getting in the CPU. but the speed will be almost half that of floating point precision and memory usage will be 2 times. Now from FERMI based architecture onwards NVIDIA GPUs are supporting the double precision point computation support.

answered Dec 19 '12 at 09:13

Sijo

619
1
7
25

I am using a GeForece GT 540 M . As far as I know, it does not support double. Thus, using float is the only choice – Programmer Dec 19 '12 at 09:42
@Programmer: That GPU is a GF108 based part, which means compute capability 2.1 and double precision support. – talonmies Dec 19 '12 at 09:53
[link](http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units) GT540M is GF108 which is FERMI architecture. – Sijo Dec 19 '12 at 09:55
but double precision is slow. – Programmer Dec 19 '12 at 11:24

Division of floating point numbers on GPU different from that on CPU

3 Answers3

Linked