Why is my C++ program so slow when switching from long double to float128?

Question

I program on Unix, using the g++ 4.8.2 compiler. I currently need to convert my C++ program, which at this point uses long double (with a significand of 64 bits in my case), to a program which uses the __float128 type (with a significand of 113 bits). I used the libquadmath0 package and the boost library to do that, but the resulting program is 10~20 times slower than with long double.

This is confusing since the size of the significand is not much higher, and I did not observe such a difference when switching from double to long double. Is this timing difference normal, and if no, how can I fix it?

The code:

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <math.h>
#include <complex.h>
extern "C" {
#include <quadmath.h>
}
#include <gmp.h>
#include <iomanip>
#include <cfloat>
#include <boost/multiprecision/float128.hpp>


using namespace boost::multiprecision;
using namespace std;

typedef __float128 long_double_t;

void main()
{
...
}

The compiling instructions:

g++ --std=c++11 main.cc -o main -lgmp -lquadmath -Ofast -m64

Perhaps one type is native to the hardware and the other is handled in software. — Captain Obvlious, Oct 15 '14 at 00:27
Ok sure float128 is emulated and long double is not, but that alone explains this huge difference? — Thomas Prest, Oct 15 '14 at 00:31
Out of interest, what do you need such high precision for? If you're doing something that involves multiplying many small numbers (e.g. probabilities), you might find that taking logarithms is all you need. — j_random_hacker, Oct 15 '14 at 00:31
Yes, that explains the huge difference. Pretend you're a professional and profile your code to determine the bottleneck. — Captain Obvlious, Oct 15 '14 at 00:32
Well I am implementing an algorithm supposed to output a distribution statistically close to a discrete gaussian, and in order for the stat. dist. to be small, i need to know the parameters of the discrete gaussian up to a high precision. Its not only about probabilities unfortunately — Thomas Prest, Oct 15 '14 at 00:34
Long before you switched to 128 bit floats, I assume you used interval mathematics (or other error tracking patterns) to bound the error and determined it was too large? — Yakk - Adam Nevraumont, Oct 15 '14 at 02:47
I did, the error is actually quite small, but not enough. I need 128 bit floats is to ensure that the statistical distance between the output distribution of my algorithm and the output distribution of a "perfect" algorithm (with infinite precision) is very small (less than 2^(-128) actually) — Thomas Prest, Oct 15 '14 at 21:42

phuclv · Accepted Answer · 2021-03-25T01:00:20.910

This is confusing since the size of the significand is not much higher, and I did not observe such a difference when switching from double to long double

Take a simple example: use a 12-digit pocket calculator to add two 8-digit numbers and then add two 11-digit numbers. Do you see the difference? And now use that calculator to add two 23-digit numbers, which one do you think will be slower? Obviously the last one needs a lot more operations (and also space as you need to write intermediate results into paper)

In x86 you have hardware support for IEEE-754 single, double and 80-bit extended precision long double so operations on those types is done completely in hardware which is typically just a single instruction. double + double is no different from long double + long double, which is the same FADD instruction in x87. If you use SSE then double will be a bit faster than long double due to the use of the new SIMD registers and instructions

When you use __float128 however the compiler needs to use software emulation which is far slower. You can't add 2 long double values with 2 instructions. You need to do everything manually:

Break the sign, exponent and significand components (at least ~3 instructions). The significand must be stored in multiple registers because you don't have such a big single integer register
Align the radix point position for the 2 values, which needs many shift and mask operations (again because the significand is stored in multiple registers)
Add the 2 significands, which needs 2 instructions on a 64-bit platform
Normalize the result, which needs to check the sum for overflow/underflow conditions, find the most significant bit position, calculate the exponent...
Combine the result's sign, exponent and significand

Those steps include several branches (which may result in branch misprediction), memory loads/stores (because x86 doesn't have a lot of registers) and many more things that finally add up to at least tens of instructions. Doing those complex tasks just 10 times slower is already a great achievement. And we're still not coming to multiplication yet, which is 4 times as difficult when the significand width is doubled. Division, square root, exponentiation, trigonometry... are far more complicated and will be significantly slower

Why is my C++ program so slow when switching from long double to float128?

1 Answers1