This is confusing since the size of the significand is not much higher, and I did not observe such a difference when switching from double
to long double
Take a simple example: use a 12-digit pocket calculator to add two 8-digit numbers and then add two 11-digit numbers. Do you see the difference? And now use that calculator to add two 23-digit numbers, which one do you think will be slower? Obviously the last one needs a lot more operations (and also space as you need to write intermediate results into paper)
In x86 you have hardware support for IEEE-754 single, double and 80-bit extended precision long double
so operations on those types is done completely in hardware which is typically just a single instruction. double + double
is no different from long double + long double
, which is the same FADD
instruction in x87. If you use SSE then double
will be a bit faster than long double
due to the use of the new SIMD registers and instructions
When you use __float128
however the compiler needs to use software emulation which is far slower. You can't add 2 long double
values with 2 instructions. You need to do everything manually:
- Break the sign, exponent and significand components (at least ~3 instructions). The significand must be stored in multiple registers because you don't have such a big single integer register
- Align the radix point position for the 2 values, which needs many shift and mask operations (again because the significand is stored in multiple registers)
- Add the 2 significands, which needs 2 instructions on a 64-bit platform
- Normalize the result, which needs to check the sum for overflow/underflow conditions, find the most significant bit position, calculate the exponent...
- Combine the result's sign, exponent and significand
Those steps include several branches (which may result in branch misprediction), memory loads/stores (because x86 doesn't have a lot of registers) and many more things that finally add up to at least tens of instructions. Doing those complex tasks just 10 times slower is already a great achievement. And we're still not coming to multiplication yet, which is 4 times as difficult when the significand width is doubled. Division, square root, exponentiation, trigonometry... are far more complicated and will be significantly slower