46

I've often noticed gcc converting multiplications into shifts in the executable. Something similar might happen when multiplying an int and a float. For example, 2 * f, might simply increment the exponent of f by 1, saving some cycles. Do the compilers, perhaps if one requests them to do so (e.g. via -ffast-math), in general, do it?

Are compilers generally smart enough to do this, or do I need to do this myself using the scalb*() or ldexp()/frexp() function family?

user1095108
  • 14,119
  • 9
  • 58
  • 116
  • 10
    What is your question? – Code-Apprentice Oct 16 '12 at 16:24
  • 3
    The question is why doesn't the compiler convert floating-point multiplies by 2 to exponent increments like it can for integers and shifts. – Mysticial Oct 16 '12 at 16:25
  • 1
    I know that it can, but do compilers in general do it? Are they smart enough already? Or do I need to do it myself? – user1095108 Oct 16 '12 at 16:26
  • I suppose I need to do an `int * float` vs `ldexp()/frexp()` vs `scalbnf()` benchmark. I'd be nicer, if someone had had done this already. – user1095108 Oct 16 '12 at 17:06
  • I think the conclusion we can come to is that it's too much of a corner case for compiler writers to bother with - especially since such hardware is almost non-existent nowadays. So if you think you can do better than the compiler, by all means do it. Beating the compiler is not impossible. Being a low-level HPC programmer myself, I beat the compiler all the time. – Mysticial Oct 16 '12 at 17:11
  • 1
    Maybe this requirement falls into that category, where people being smart enough to want it, are better suited to implement it by them selves than to answer to the lesser souls at stackoverflow what is happening... – Aki Suihkonen Oct 16 '12 at 17:57
  • 4
    @user1095108: I have done a benchmark of ldexp/increment/frexp. At least in my testing (VC++ and gcc, on x86), a multiplication was *much* faster. – Jerry Coffin Oct 16 '12 at 18:29
  • 3
    [my test also show multiplication being about 134x faster than `scalbln`.](http://ideone.com/hekZJ) – Mooing Duck Oct 16 '12 at 18:39

9 Answers9

85

For example, 2 * f, might simply increment the exponent of f by 1, saving some cycles.

This simply isn't true.

First you have too many corner cases such as zero, infinity, Nan, and denormals. Then you have the performance issue.

The misunderstanding is that incrementing the exponent is not faster than doing a multiplication.

If you look at the hardware instructions, there is no direct way to increment the exponent. So what you need to do instead is:

  1. Bitwise convert into integer.
  2. Increment the exponent.
  3. Bitwise convert back to floating-point.

There is generally a medium to large latency for moving data between the integer and floating-point execution units. So in the end, this "optimization" becomes much worse than a simple floating-point multiply.

So the reason why the compiler doesn't do this "optimization" is because it isn't any faster.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • Hardware instructions of what CPU? There might exist one that has such an instruction. – user1095108 Oct 16 '12 at 16:35
  • 4
    I'm assuming x86 since you didn't specify. But it isn't too different for other architectures. – Mysticial Oct 16 '12 at 16:36
  • Ok, but bitwise convert into integer? Exponents are integers, no conversion is necessary. – user1095108 Oct 16 '12 at 16:37
  • 2
    Yes, a conversion is necessary. In C/C++ you would do it using type-punning via union. This "conversion" isn't cheap because of the way modern processors are designed. (Especially with separate integer and floating-point units.) – Mysticial Oct 16 '12 at 16:39
  • The FPU or CPU needs to do the addition/subtraction of exponents when multiplying/dividing. This is built into the hardware, I assume. The only problem is how to tell it to do it. – user1095108 Oct 16 '12 at 16:41
  • 4
    @user1095108 It may *have* the hardware, but it's not accessible (by itself) via an instruction. Think of it as a private function of a class. – Mysticial Oct 16 '12 at 16:43
  • It's not even necessary to separate the exponent. Just add 1<<23 (for floats, 1<<53? for doubles). And there are in many architectures instructions to mov fp to gpr and vice versa and also in many architectures there is a huge penalty of doing so... – Aki Suihkonen Oct 16 '12 at 16:50
  • 2
    @AkiSuihkonen Yes, hence why I omitted the masking/shifting steps from that 3-step process. The expensive part is the bitwise conversions themselves (type-punning). – Mysticial Oct 16 '12 at 16:51
  • Please, see (http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html), that `scalb*` functions are built-in functions of `gcc`. This usually means fast implementations. – user1095108 Oct 16 '12 at 16:53
  • 8
    Built-in or not, have you actually looked at the instructions that they generate? Have you even benchmarked them? It doesn't matter what the compiler does, it still has to respect the ISA. – Mysticial Oct 16 '12 at 16:55
  • “zero, ..., and denormals” I hate to be that person, but I **have** to point out that each zero is just a denormal like the 2^23-1 others of the same sign. :) – Pascal Cuoq Oct 16 '12 at 17:42
  • Mr. Pascal: Could division make an exception? 0/0 == Nan instead of Inf? – Aki Suihkonen Oct 16 '12 at 18:06
  • @AkiSuihkonen If someone said that all denormals had to produce the same results in all circumstances, it was not me. What I said was that, in a hypothetical implementation of a `*2` operation as an exponent increment, the special handling for denormals would also handle the case "zero" without additional logic. I stand by this statement. – Pascal Cuoq Oct 16 '12 at 18:16
  • Downvoter care to comment? If there's anything that can be fixed, I'm more than willing to do so. – Mysticial Oct 24 '12 at 17:16
  • 2
    @Mysticial: I think it would be a lot clearer if you wrote "copy from floating-point stack (to RAM) to general-purpose accumulator" (and back) instead of "bitwise convert". In nearly all cases bitwise conversion is free, because it means "treat this general purpose accumulator as if it holds a different type" -- but in this case it isn't. – Ben Voigt Apr 08 '16 at 03:49
  • @BenVoigt Ah. I used this original wording because it applies both to the FPU and the SIMD registers. Only with the FPU do you need to go through the stack. On the SIMD registers you can just reinterpret with `_mm_castxx` but you'll still get hit with bypass delays. So still a penalty, but different source. – Mysticial Apr 08 '16 at 04:03
  • The corner cases are the problem (especially 0.0), but it *is* a single instruction on x86 using SSE2 for scalar or SIMD floating point; `paddq` (64-bit) or `paddd` (32-bit) with a constant to add an integer to the FP bit-pattern. At worst you'll get some extra bypass-forwarding latency if using this between other FP math instructions. – Peter Cordes Oct 03 '22 at 21:00
24

On modern CPUs, multiplication typically has one-per-cycle throughput and low latency. If the value is already in a floating point register, there's no way you'll beat that by juggling it around to do integer arithmetic on the representation. If it's in memory to begin with, and if you're assuming neither the current value nor the correct result would be zero, denormal, nan, or infinity, then it might be faster to perform something like

addl $0x100000, 4(%eax)   # x86 asm example

to multiply by two; the only time I could see this being beneficial is if you're operating on a whole array of floating-point data that's bounded away from zero and infinity, and scaling by a power of two is the only operation you'll be performing (so you don't have any existing reason to be loading the data into floating point registers).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • +1, in which case, you could just add up all those scaling factors and do it only once at the end. – Mysticial Oct 16 '12 at 17:51
  • 2
    +1 for giving a possible case when it might actually be faster (even though I doubt it will ever be worth the trouble of doing) – Leo Oct 17 '12 at 08:43
  • 2
    If you're working with many-channel, high-sample-rate floating-point audio on a low-end/embedded system, and mostly just passing it thru, this kind of optimization could be useful. – R.. GitHub STOP HELPING ICE Oct 17 '12 at 13:09
22

Common floating-point formats, particularly IEEE 754, do not store the exponent as a simple integer, and treating it as an integer will not produce correct results.

In 32-bit float or 64-bit double, the exponent field is 8 or 11 bits, respectively. The exponent codes 1 to 254 (in float) or 1 to 2046 (in double) do act like integers: If you add one to one of these values and the result is one of these values, then the represented value doubles. However, adding one fails in these situations:

  • The initial value is 0 or subnormal. In this case, the exponent field starts at zero, and adding one to it adds 2-126 (in float) or 2-1022 (in double) to the number; it does not double the number.
  • The initial value exceeds 2127 (in float) or 21023 (in double). In this case, the exponent field starts at 254 or 2046, and adding one to it changes the number to a NaN; it does not double the number.
  • The initial value is infinity or a NaN. In this case, the exponent field starts at 255 or 2047, and adding one to it changes it to zero (and is likely to overflow into the sign bit). The result is zero or a subnormal but should be infinity or a NaN, respectively.

(The above is for positive signs. The situation is symmetric with negative signs.)

As others have noted, some processors do not have facilities for manipulating the bits of floating-point values quickly. Even on those that do, the exponent field is not isolated from the other bits, so you typically cannot add one to it without overflowing into the sign bit in the last case above.

Although some applications can tolerate shortcuts such as neglecting subnormals or NaNs or even infinities, it is rare that applications can ignore zero. Since adding one to the exponent fails to handle zero properly, it is not usable.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
10

It's not about compilers or compiler writers not being smart. It's more like obeying standards and producing all the necessary "side effects" such as Infs, Nans, and denormals.

Also it can be about not producing other side effects that are not called for, such as reading memory. But I do recognize that it can be faster in some circumstances.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • Maybe in your domain, I don't care about NaNs. Don't forget `-ffast-math` either. – user1095108 Oct 16 '12 at 16:28
  • 5
    @user1095108 But how is the compiler supposed to know that you don't care? – Mysticial Oct 16 '12 at 16:28
  • @user1095108 - wow, what do you mean, that you don't care about NaNs? Do you care about the ISO standard? – Kiril Kirov Oct 16 '12 at 16:28
  • 1
    There could be methods to tell compiler how much one cares... #pragmas or __attributes__. And those requirements OTOH are relaxed already in some domains, such as opengl shader languages. – Aki Suihkonen Oct 16 '12 at 16:31
3

Actually, this is what happens in the hardware.

The 2 is also passed into the FPU as a floating point number, with a mantissa of 1.0 and an exponent of 2^1. For the multiplication, the exponents are added, and the mantissas multiplied.

Given that there is dedicated hardware to handle the complex case (multiplying with values that are not powers of two), and the special case is not handled any worse than it would be using dedicated hardware, there is no point in having additional circuitry and instructions.

Simon Richter
  • 28,572
  • 1
  • 42
  • 64
2

Here's an actual compiler optimization I'm seeing with GCC 10:

x = 2.0 * hi * lo;

Generates this code:

mulsd   %xmm1, %xmm0      # x = hi * lo;
addsd   %xmm0, %xmm0      # x += x;
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
1

If you think that multiplying by two means increasing the exponent by 1, think again. Here are the possible cases for IEEE 754 floating-point arithmetic:

Case 1: Infinity and NaN stay unchanged.

Case 2: Floating-point numbers with the largest possible exponent are changed to Infinity by increasing the exponent and setting the mantissa except for the sign bit to zero.

Case 3: Normalised floating-point numbers with exponent less than the maximum possible exponent have their exponent increased by one. Yippee!!!

Case 4: Denormalised floating-point numbers with the highest mantissa bit set have their exponent increased by one, turning them into normalised numbers.

Case 5: Denormalised floating-point numbers with the highest mantissa bit cleared, including +0 and -0, have their mantissa shifted to the left by one bit position, leaving the exponent unchanged.

I very much doubt that a compiler producing integer code handling all these cases correctly will be anywhere as fast as the floating-point built into the processor. And it's only suitable for multiplication by 2.0. For multiplication by 4.0 or 0.5, a whole new set of rules applies. And for the case of multiplication by 2.0, you might try to replace x * 2.0 with x + x, and many compilers do this. That is they do it, because a processor might be able for example to do one addition and one multiplication at the same time, but not one of each kind. So sometimes you would prefer x * 2.0, and sometimes x + x, depending on what other operations need doing at the same time.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
0

It may be useful for embedded systems compilers to have special scale-by-power-of-two pseudo-op which could be translated by the code generator in whatever fashion was optimal for the machine in question, since on some embedded processors focusing on the exponent may be an order of magnitude faster than doing a full power-of-two multiplication, but on the embedded micros where multiplication is slowest, a compiler could probably achieve a bigger performance boost by having the floating-point-multiply routine check its arguments at run-time so as to skip over parts of the mantissa that are zero.

supercat
  • 77,689
  • 9
  • 166
  • 211
0

A previous Stackoverflow question about multiplication by powers of 2. The consensus, and the actual implementations, proved that unfortunately, there is no current way to be more efficient than standard multiplication.

Community
  • 1
  • 1
B. Decoster
  • 7,723
  • 1
  • 32
  • 52
  • What you say may be true for the current x86 architectures, but for others? – user1095108 Nov 14 '12 at 10:22
  • Well, the only other architecture tested was x87, and it was a **disaster** – B. Decoster Nov 14 '12 at 12:57
  • @user1095108: The optimal approach would probably to have the compiler emit an intermediate-form instruction which would be recognizable as "add/subtract/multiply/divide by constant"; in the event that the compiled code is targeted for a processor without an FPU, there are many cases where operations with constants could be implemented more efficiently than a normal floating-point multiply; indeed, in the days when compilers supported "x87-style" math that was even more true than today [an under-appreciated feature of extended-double and extended-float is that... – supercat Feb 24 '15 at 00:32
  • ...in many cases machines without an FPU can perform sequences of operations on extended-precision types *faster* than on `float` and `double`. Evaluating an expression like `16000003.0f-16000002.0f+16000000.0f` purely in `float` would require putting the intermediate result many iterations of a shift-and-test loop to normalize it, even though it would then have to be shifted back in preparation for the next addition. Storing a 32-bit mantissa *without an implied '1'* in a register separate from the exponent would avoid such difficulties. – supercat Feb 24 '15 at 00:42