Inconsistent multiplication performance with floats

Question

While testing the performance of floats in .NET, I stumbled unto a weird case: for certain values, multiplication seems way slower than normal. Here is the test case:

using System;
using System.Diagnostics;

namespace NumericPerfTestCSharp {
    class Program {
        static void Main() {
            Benchmark(() => float32Multiply(0.1f), "\nfloat32Multiply(0.1f)");
            Benchmark(() => float32Multiply(0.9f), "\nfloat32Multiply(0.9f)");
            Benchmark(() => float32Multiply(0.99f), "\nfloat32Multiply(0.99f)");
            Benchmark(() => float32Multiply(0.999f), "\nfloat32Multiply(0.999f)");
            Benchmark(() => float32Multiply(1f), "\nfloat32Multiply(1f)");
        }

        static void float32Multiply(float param) {
            float n = 1000f;
            for (int i = 0; i < 1000000; ++i) {
                n = n * param;
            }
            // Write result to prevent the compiler from optimizing the entire method away
            Console.Write(n);
        }

        static void Benchmark(Action func, string message) {
            // warm-up call
            func();

            var sw = Stopwatch.StartNew();
            for (int i = 0; i < 5; ++i) {
                func();
            }
            Console.WriteLine(message + " : {0} ms", sw.ElapsedMilliseconds);
        }
    }
}

Results:

float32Multiply(0.1f) : 7 ms
float32Multiply(0.9f) : 946 ms
float32Multiply(0.99f) : 8 ms
float32Multiply(0.999f) : 7 ms
float32Multiply(1f) : 7 ms

Why are the results so different for param = 0.9f?

Test parameters: .NET 4.5, Release build, code optimizations ON, x86, no debugger attached.

Repeating this for all numbers from 0.0 to 1.0 in steps of 0.01 I see a slowdown of factor 50 for values between 0.5 and 1.0, both exclusively, and only with the debugger attached on .NET 4.0, release build, x86, optimized. — Daniel Brückner, Dec 20 '12 at 03:09
And for the slow values the final value of `n` is `1.401298E-45` while it is `0` for the fast values. Note that I changed the number of iterations in `float32Multiply()`. — Daniel Brückner, Dec 20 '12 at 03:19
Now I got the slowdown also without debugger attached but the slowdown faded between `0.8`and `0.9` instead of abruptly disappearing just below `1.0`. Strange things. — Daniel Brückner, Dec 20 '12 at 03:26
IEEE-754 single-precision floating-point numbers less than 2**-126 (1.1754944e-38) in magnitude are denormal numbers. In many cases, floating-point units handle such denormal numbers quite slowly. The slowdown could easily be a factor of 100. If the computation is mapped to SSE there are DAZ (denormals are zero) and FTZ (flush to zero) modes which treat these small numbers as zero on input and output respectively, avoiding the slowdown. Your compiler may have a switch to turn those modes on. Usually compiler flags turn both DAZ and FTZ on/off together. — njuffa, Dec 20 '12 at 03:45
@njuffa... Yep, I think that's on the money. I hit this doing audio processing in c#. I don't think there are compiler options for this. I think the JIT only uses x87, not SSE. — spender, Dec 20 '12 at 03:52
Let me guess, you are using an IIR filter? There is no FTZ mode in x87, but one possibility is to flush operands to zero in the code when their magnitude approaches 2**-126. This will create some amount of overhead, but it may still result in better performance than when the floating-point unit handles denormals. — njuffa, Dec 20 '12 at 05:14
@spender Late reply, but wanted to answer: the 32-bit JIT uses x87 and the 64-bit JIT uses SSE. Mono behaves the same way. — Asik, Sep 13 '13 at 14:22

Eric Postpischil · Accepted Answer · 2013-06-18T13:08:52.487

As others have mentioned, various processors do not support normal-speed calculations when subnormal floating-point values are involved. This is either a design defect (if the behavior impairs your application or is otherwise troublesome) or a feature (if you prefer the cheaper processor or alternative use of silicon that was enabled by not using gates for this work).

It is illuminating to understand why there is a transition at .5:

Suppose you are multiplying by p. Eventually, the value becomes so small that the result is some subnormal value (below 2^-126 in 32-bit IEEE binary floating point). Then multiplication becomes slow. As you continue multiplying, the value continues decreasing, and it reaches 2^-149, which is the smallest positive number that can be represented. Now, when you multiply by p, the exact result is of course 2^-149p, which is between 0 and 2^-149, which are the two nearest representable values. The machine must round the result and return one of these two values.

Which one? If p is less than ½, then 2^-149p is closer to 0 than to 2^-149, so the machine returns 0. Then you are not working with subnormal values anymore, and multiplication is fast again. If p is greater than ½, then 2^-149p is closer to 2^-149 than to 0, so the machine returns 2^-149, and you continue working with subnormal values, and multiplication remains slow. If p is exactly ½, the rounding rules say to use the value that has zero in the low bit of its significand (fraction portion), which is zero (2^-149 has 1 in its low bit).

You report that .99f appears fast. This should end with the slow behavior. Perhaps the code you posted is not exactly the code for which you measured fast performance with .99f? Perhaps the starting value or the number of iterations were changed?

There are ways to work around this problem. One is that the hardware has mode settings that specify to change any subnormal values used or obtained to zero, called “denormals as zero” or “flush to zero” modes. I do not use .NET and cannot advise you about how to set these modes in .NET.

Another approach is to add a tiny value each time, such as

n = (n+e) * param;

where e is at least 2^-126/param. Note that 2^-126/param should be calculated rounded upward, unless you can guarantee that n is large enough that (n+e) * param does not produce a subnormal value. This also presumes n is not negative. The effect of this is to make sure the calculated value is always large enough to be in the normal range, never subnormal.

Adding e in this way of course changes the results. However, if you are, for example, processing audio with some echo effect (or other filter), then the value of e is too small to cause any effects observable by humans listening to the audio. It is likely too small to cause any change in the hardware behavior when producing the audio.

Great answer, @Eric... I ran the supplied code and confirm that the .99 measurement reported is correct... I wonder what else could be at play to cause this inconsistency? +1 — spender, Dec 20 '12 at 19:30
@spender: Well, one possibility is the implementation is using 80-bit floating point to do the arithmetic, so it has a much larger exponent range than 32-bit floats. Then the .99 case might never reach the 80-bit subnormal range. Increasing the number of iterations might reveal that. — Eric Postpischil, Dec 20 '12 at 19:59

score 2 · Answer 2 · answered Dec 20 '12 at 03:43

I suspect this has something to do with denormal values (fp values smaller than ~ 1e-38) and the cost associated with processing them.

If you test for denormal values and remove them, sanity is restored.

    static void float32Multiply(float param) {
        float n = 1000f;
        int zeroCount=0;
        for (int i = 0; i < 1000000; ++i) {
            n = n * param;
            if(n<1e-38)n=0;
        }
        // Write result to prevent the compiler from optimizing the entire method away
        Console.Write(n);
    }

Inconsistent multiplication performance with floats

2 Answers2

Linked