Double vs Float vs _Float16 (Running Time)

Question

I have a simple question in C language. I am implementing a half-precision software using _Float16 in C (My mac is based on ARM), but running time is not quite faster than single or double-precision software. I tested half, single, double with a very simple code like just adding the number. the speed of half is slower than single or double. In addition, single is similar to double.

typedef double FP;
// double - double precision
// float - single precision
// _Float16 - half precision
int main(int argc, const char * argv[]) {

    float time;
    clock_t start1, end1;
    start1 = clock();

    int i;
    FP temp = 0;

    for(i = 0; i< 100; i++){
        temp = temp + i;
    }
    end1 = clock();
    time = (double)(end1 - start1)/CLOCKS_PER_SEC;

    printf("[] %.16f\n", time);
    return 0;
}

In my expectation, half-precision is very faster than single or double precision. How can I check half-precision is faster and float is faster than double?.

Please Help Me.

Are you compiling with optimizations on? Have you looked at the generated code to see if it is any different? — Retired Ninja, Jul 09 '22 at 16:40
You seem to be assuming that computations with narrower floating-point types should be faster than computations with wider types. That is by no means a safe assumption. — John Bollinger, Jul 09 '22 at 16:44
does your fpu operate directly on sthis type? If not additional operations are needed — 0___________, Jul 09 '22 at 16:45
As far as I am aware, no ARM FPU provides direct support for 16-bit floating-point values. — John Bollinger, Jul 09 '22 at 16:49
There are ARM systems with half precision. At the very least OpenCL on Vivante GPU with a 'half' type, neon has F16 and VFPv4 support half floats. Whether the OPs tooling (and options) have support for these or not is another story. Answers with something like [MAC /proc/cpuinfo](https://apple.stackexchange.com/questions/352769/does-macos-have-a-command-to-retrieve-detailed-cpu-information-like-proc-cpuinf), but I don't really care to help Apple myself. I guess this is custom core by Apple, so who knows... but there are ARMs with 16 bit floats. — artless noise, Jul 09 '22 at 17:06
Note that `clock()` is a very coarse tool to time only 100 operations. Time several million operations, and don't just add integer values. — Weather Vane, Jul 09 '22 at 17:12
YUNBLACK, Rather than describe the results onyl with "I tested half, single, double with a very simple code like just adding the number. the speed of half is slower than single or double. In addition, single is similar to double.", post the numeric results. — chux - Reinstate Monica, Jul 09 '22 at 18:18
The type _is_ supported by `gcc` [to varying degrees]: https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html — Craig Estey, Jul 09 '22 at 18:37
Your testcode is either so simple that your compiler optimizes it away, or if it does not, it would be limited by latency and not throughput. On many modern CPUs latency of add/sub/mul is the same for `float` and `double`. You can get higher throughput with smaller types if your code can use instruction level parallelism (search for "SIMD"). Try to create a meaningful benchmark which does something your actual code will do. — chtz, Jul 09 '22 at 21:30
The gain from using a shorter type comes from vectorizations. One can simply put twice more data into a vector register allowing doing more operations in parallel. — tstanisl, Jul 10 '22 at 07:13

Steve Summit · Answer 1 · 2022-07-09T22:02:20.417

Here is an eminently surprising fact about floating point:

Single-precision (float) arithmetic is not necessarily faster than double precision.

How can this be? Floating-point arithmetic is hard, so doing it with twice the precision is at least twice as hard and must take longer, right?

Well, no. Yes, it's more work to compute with higher precision, but as long as the work is being done by dedicated hardware (by some kind of floating point unit, or FPU), everything is probably happening in parallel. Double precision may be twice as hard, and there may therefore be twice as many transistors devoted to it, but it doesn't take any longer.

In fact, if you're on a system with an FPU that supports both single- and double-precision floating point, a good rule is: always use double. The reason for this rule is that type float is often inadequately accurate. So if you always use double, you'll quite often avoid numerical inaccuracies (that would kill you, if you used float), but it won't be any slower.

Now, everything I've said so far assumes that your FPU does support the types you care about, in hardware. If there's a floating-point type that's not supported in hardware, if it has to be emulated in software, it's obviously going to be slower, often much slower. There are at least three areas where this effect manifests:

If you're using a microcontroller, with no FPU at all, it's common for all floating point to be implemented in software, and to be painfully slow. (I think it's also common for the double precision to be even slower, meaning that float may be advantageous there.)
If you're using a nonstandard or less-than-standard type, that for that reason is implemented in software, it's obviously going to be slower. In particular: FPU's I'm familiar don't support a half-precision (16-bit) floating point type, so yes, it wouldn't be surprising if it was significantly slower than regular float or double.
Some GPU's have good support for single or half precision, but poor or no support for double.

score 0 · Answer 2 · answered Jul 11 '22 at 18:54

I've extracted out the relevant part of your code into C++ so it can be easily instantiated for each type:

template<typename T>
T calc() {
    T sum = 0;
    for (int i = 0; i < 100; i++) {
        sum += i;
    }
    return sum;
}

Compiling this in Clang with optimisations (-O3) and looking at the assembly listing on godbolt suggests that:

the double version has the least number of instructions (4) in the inner loop
the float version has 5 instructions in the inner loop, and looks basically comparable to the double version
the _Float16 version has 9 instructions in the inner loop, hence likely being slowest. the extra instructions are fcvt which convert between float16 and float32 formats.

Note that counting instructions is only a rough guide to performance! E.g. Some instructions take multiple cycles to execute and pipelined execution means that multiple instructions can be executed in parallel.

Clang's language extension docs suggest that _Float16 is supported on ARMv8.2a, and M1 appears to be v8.4, so presumably it also supports this. I'm not sure how to enable this in Godbolt though, sorry!

I'd use clock_gettime(CLOCK_MONOTONIC) for high precision (i.e. nanosecond) timing under Linux. OSX doesn't appear to make this available, but alternatives seem available Monotonic clock on OSX.

Double vs Float vs _Float16 (Running Time)

2 Answers2