9

Which version is faster: x * 0.5 or x / 2 ?

I've had a course at the university called computer systems some time ago. From back then I remember that multiplying two values can be achieved with comparably "simple" logical gates but division is not a "native" operation and requires a sum register that is in a loop increased by the divisor and compared to the dividend.

Now I have to optimise an algorithm with a lot of divisions. Unfortunately it's not just dividing by two, so binary shifting is not an option. Will it make a difference to change all divisions to multiplications ?

Update:

I have changed my code and didn't notice any difference. You're probably right about compiler optimisations. Since all the answers were great ive upvoted them all. I chose rahul's answer because of the great link.

Paul R
  • 208,748
  • 37
  • 389
  • 560
lhk
  • 27,458
  • 30
  • 122
  • 201
  • well if you do a loop of 1mil operations and time it, I think you can get your answer that way :D – Lefteris Oct 19 '12 at 15:05
  • floating point multiplications and divisions are probably about equally fast. I suspect that for integers, multiplication is significantly faster. Also, integer operations tend to be faster than FP ones. In other words, iMult < iDiv < fpMult = fpDiv (WRT time) – Wug Oct 19 '12 at 15:09
  • @Wug: no, this is not true - *throughput* may be the same on modern CPUs but typically *latency* for division is much higher than for multiplication. – Paul R Oct 19 '12 at 15:10
  • 1
    Are you dividing by a constant or dividing by a variable? You didn't explicitly say. – Chris A. Oct 19 '12 at 15:17
  • 4
    @Wug, FP multiplication takes 5 cycles on modern Sandy Bridge processor, FP division takes 10 to 14 cycles for scalar SSE division and up to 29 cycles for vector AVX division. It also takes between 10 and 24 cycles to perform division in the x87 unit. – Hristo Iliev Oct 19 '12 at 15:26
  • What about integer operations? All one cycle? I suspect multiplication is one cycle, but is division? – Wug Oct 19 '12 at 17:25
  • 2
    On Sandy Bridge, depending on the instruction variant, for integer multiply the latency is 3 - 4 cycles, throughput is 1 - 2 cycles, whereas for integer division it's 20 - 103 cycles latency, 11 - 84 cycles throughput (the high end of the range is for 64 bit integer division, but even for 32 bit the numbers are still an order of magnitude greater than for multiplication). See Agner Fog's site for detailed info. – Paul R Oct 19 '12 at 18:46
  • See also a more asm oriented version of this question: http://stackoverflow.com/questions/38977240/reading-assembly-comparing-multiplication-vs-division – Peter Cordes Aug 16 '16 at 20:25
  • Related: [Floating point division vs floating point multiplication](https://stackoverflow.com/a/45899202) – Peter Cordes Feb 25 '22 at 11:08

5 Answers5

11

Usually division is a lot more expensive than multiplication, but a smart compiler will often convert division by a compile-time constant to a multiplication anyway. If your compiler is not smart enough though, or if there are floating point accuracy issues, then you can always do the optimisation explicitly, e.g. change:

 float x = y / 2.5f;

to:

 const float k = 1.0f / 2.5f;

 ...

 float x = y * k;

Note that this is most likely a case of premature optimisation - you should only do this kind of thing if you have profiled your code and positively identified division as being a performance bottlneck.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    I'd say that if you've found that a loop containing a division is taking a lot of time, this kind of optimization is one of the first things you should try. Compilers aren't generally allowed to do this for you without `-ffast-math`. It might happen without -ffast-math for divide by two, because `0.5f` is exactly representable, but non-constant divisors (and most constants) definitely can't do this without changing the result. – Peter Cordes Aug 09 '16 at 19:43
  • 1
    BTW, yes, gcc does do it for powers of 2. https://godbolt.org/g/N42LEK – Peter Cordes Aug 09 '16 at 19:50
  • 1
    This also depends on the constants and on guarantees of accuracy. In mathematical sense, you can convert x / 10.0 to x * 0.1. In binary floating-point arithmetics, you can lose accuracy, because 0.1 is a periodic number in binary form, so it needs to be rounded. This does not happen with number 10. You can see a example there: https://onlinegdb.com/rJ56WrvOS In some cases, further rounding resolves the gap, in some other cases, it doesn't. Of course, the compiler can do it in some cases division by 2. But in some cases, you have to do it manually at potential cost of accuracy (if acceptable). – v6ak Oct 06 '19 at 10:36
  • I know it's a very old question but what about division by one ? Is the computer intelligent enough to understand no operation has to be done in that case? – Getter Jun 30 '22 at 14:23
  • @Getter: yes, if the divisor is a compile-time constant equal to 1 then [the division becomes a no-op](https://godbolt.org/z/Ks34PdhvM) – Paul R Jun 30 '22 at 17:21
  • Sorry I should've been more precise. I meant in the specific case that the divisor variable was 1. Imagine you do 100 divisions, of which 50 are with divisor = 1. Are those 50 divisions faster than the others 50 done with a divisor ≠ 1 ? – Getter Jul 01 '22 at 06:14
  • 1
    @Getter: no, the overhead of testing for a 1 divisor on every iteration would not be cost effective for the general case. If you expect the divisor to be 1 quite frequently then you should test for this explicitly in your code. Note however that some CPU architectures have variable latency divide instructions so you may see some benefit on certain CPUs when the divide results in an “early out”. – Paul R Jul 01 '22 at 14:33
8

Division by a compile-time constant that's a power of 2 is quite fast (comparable to multiplication by a compile-time constant) for both integers and floats (it's basically convertible into a bit shift).

For floats even dynamic division by powers of two is much faster than regular (dynamic or static division) as it basically turns into a subtraction on its exponent.

In all other cases, division appears to be several times slower than multiplication.

For dynamic divisor the slowndown factor at my Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz appears to be about 8, for static ones about 2.

The results are from a little benchmark of mine, which I made because I was somewhat curious about this (notice the aberrations at powers of two) :

enter image description here

enter image description here

  • ulong -- 64 bit unsigned
  • 1 in the label means dynamic argument
  • 0 in the lable means statically known argument

The results were generated from the following bash template:

#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
int main(int argc, char** argv){
    $TYPE arg = atoi(argv[1]);
    $TYPE i = 0, res = 0;
    for (i=0;i< $IT;i++)
        res+=i $OP $ARG;
    printf($FMT, res);
    return 0;
}

with the $-variables assigned and the resulting program compiled with -O3 and run (dynamic values came from the command line as it's obvious from the C code).

Petr Skocik
  • 58,047
  • 6
  • 95
  • 142
  • 2
    FYI, the throughputs and latencies of integer and floating point multiply and add are known to be constant (and not data-dependent) on modern x86 CPUs (including your Intel Nehalem). Division performance is known to be data-dependent (and only partially pipelined). This is an interesting test of which actual data produce throughputs at the lower or upper end of the ranges in [Agner Fog's instruction tables](http://www.agner.org/optimize/). Note that 64-bit division is tremendously slower than 64-bit multiplication, by a bigger factor than with 32-bit. What size is `ulong` in your testing? – Peter Cordes Aug 09 '16 at 19:48
  • @PeterCordes 64 bits (added it to the answer). Thank you for the link. I was actually looking for something like that but couldn't find it so I learned a little bit about gnuplot and generated some easily viewable data myself. – Petr Skocik Aug 09 '16 at 20:11
  • Am I just missing it, or are these graphs missing a key? What do the little colored symbols mean? I guess those four lines of text overlaid on top of the graph are supposed to be the key, but they don't match up to a symbol. – Cody Gray - on strike Nov 23 '16 at 12:17
  • @CodyGray top righ corner. (Unfortunately, it kind of blends with the actual graph.) `ulong.0.div => red cross, ulong.1.div => blue asterisk, ...`. 0 -- statically known argument (intermediates), 1 -- dynamic argument ( as explained below in the buletpoint list below the graphs). – Petr Skocik Nov 23 '16 at 12:22
4

That will likely depend on your specific CPU and the types of your arguments. For instance, in your example you're doing a floating-point multiplication but an integer division. (Probably, at least, in most languages I know of that use C syntax.)

If you are doing work in assembler, you can look up the specific instructions you are using and see how long they take.

If you are not doing work in assembler, you probably don't need to care. All modern compilers with optimization will change your operations in this way to the most appropriate instructions.

Your big wins on optimization will not be from twiddling the arithmetic like this. Instead, focus on how well you are using your cache. Consider whether there are algorithm changes that might speed things up.

Alan Shutko
  • 381
  • 2
  • 6
  • Re: asm throughput and latency numbers for recent x86 CPUs: [Floating point division vs floating point multiplication](https://stackoverflow.com/a/45899202) FP multiply has better throughput (2/clock) than integer multiply (1/clock), and on Intel CPUs FP division actually has better throughput than integer division (especially 64-bit integer division). – Peter Cordes Feb 25 '22 at 11:11
4

Well if it is a single calculation you wil hardly notice any difference but if you talk about millions of transaction then definitely Division is costlier than Multiplication. You can always use whatever is the clearest and readable.

Please refer this link:- Should I use multiplication or division?

Community
  • 1
  • 1
Rahul Tripathi
  • 168,305
  • 31
  • 280
  • 331
0

One note to make, if you are looking for numerical stability:

Don't recycle the divisions for solutions that require multiple components/coordinates, e.g. like implementing an n-D vector normalize() function, i.e. the following will NOT give you a unit-length vector:

V3d v3d(x,y,z);
float l = v3d.length();
float oneOverL = 1.f / l;
v3d.x *= oneOverL;
v3d.y *= oneOverL;
v3d.z *= oneOverL;
assert(1. == v3d.length()); // fails!

.. but this code will..

V3d v3d(x,y,z);
float l = v3d.length();
v3d.x /= l;
v3d.y /= l;
v3d.z /= l;
assert(1. == v3d.length()); // ok!

Guess the problem in the first code excerpt is the additional float normalization (the pre-division will impose a different scale normalization to the floating point number, which is then forced upon the actual result and introducing additional error).

Didn't look into this for too long, so please share your explanation why this happens. Tested it with x,y and z being .1f (and with doubles instead of floats)

StarShine
  • 1,940
  • 1
  • 27
  • 45
  • 1
    `float oneOverL = 1.f / l;` can have rounding error. In fact `v3d.length()` will normally have some rounding error, assuming the square root of the sum of square can't be exactly represented. But so will all of `v3d.x /= l;` or `v3d.x *= oneOverL;` in most cases. In some special simple cases, avoiding the separate `1. / l` step may happen to make other computations exact, or happen to make the rounding errors add up to zero. – Peter Cordes Feb 25 '22 at 10:00
  • With an extended-precision calculator (`calc` aka apcalc), `sqrt(0.1 ^2 * 3)` comes out to 0.17320508075688772935 exactly. Doing `0.1` over that gives about `~0.57735026918962576452...`. Taking the length of that (`sqrt(x^2 * 3)`) gives 1.00000000000000000002. So it's not automatically flawless, it just apparently happened to work out with `0.1` inputs and rounding to the nearest representable `float` or `double` at each step. – Peter Cordes Feb 25 '22 at 10:04
  • So I don't think you can count on FP rounding producing x,y,z that calculate to a rounded length of *exactly* 1.0 even from this in the general case of different FP inputs, so it's usually not worth making it significantly slower with extra divisions. – Peter Cordes Feb 25 '22 at 10:06
  • if numbers cannot be represented it automatically introduces error, and indeed various numbers will give different outcome (being equal or not to 1.0 +/- eps. But I guess the latter code is a bit more stable since there's just one operation, no? – StarShine Feb 25 '22 at 10:08
  • I'm not sure "stability" is the right term for it, since you're not getting huge errors at any point. Relative error is probably about the same for most inputs. But yes, doing `1. / l` separately introduces another step with separate rounding error, and that error biases all 3 components in the same direction, either larger or smaller. – Peter Cordes Feb 25 '22 at 10:13
  • BTW, your code isn't comparing with a margin of +- FLT_EPSILON, it's requiring exact equality after rounding. For that `==` to be true, the exact result (before rounding) of the final sqrt in `v3d.length()` has to be within 0.5 ulp of 1.0, i.e. within FLT_EPSILON/2. (Because epsilon is defined as 1 unit in the last place, low bit of mantissa, for a number with magnitude 1.0) – Peter Cordes Feb 25 '22 at 10:16
  • BTW, you could be checking that `1.0 ^ 2 == length ^2`. Since 1.0 ^ 2 is just 1, you'd be checking `1.0 = x^2 + y^2 + z^2`, which would make it even more sensitive. Square root makes numbers closer to 1.0 (e.g. sqrt(0.99) is 0.994987... and sqrt(1.01) is 1.004987) so that makes it easier for error in the sum of squares to disappear. sqrt(x) == 1.0 is true for x near 1.0 even if the low few bits of the mantissa were non-zero, I think. So interestingly, `v3d.length() == 1.0` has a better chance than most of being exactly equal even after inexact calculation to produce the x,y,z components. – Peter Cordes Feb 25 '22 at 10:22
  • 1
    Oh, it's not as wide an error range as I was thinking. Only one step up from `1.0f` has a sqrt of exactly 1.0, and the first step down sqrts to itself. https://godbolt.org/z/Y7GWbvrGe is a test using `x = nextafterf(x, INFINITY)`. So the sqrt gives you an extra margin for error in *one* direction, so the way that works for `.1` might be party coincidence. – Peter Cordes Feb 25 '22 at 11:06
  • I like the term party coincidence. – StarShine Feb 25 '22 at 14:29