C/C++: Multiply, or bitshift then divide?

Question

Where it's possible to do so, I'm wondering if it's faster to replace a single multiplication with a bitshift followed by an integer division. Say I've got an int k and I want to multiply it by 2.25.

What's faster?

int k = 5;
k *= 2.25;
std::cout << k << std::endl;

or

int k = 5;
k = (k<<1) + (k/4);
std::cout << k << std::endl;

Output

11
11

Both give the same result, you can check this full example.

Depending on the quality of the compiler either is faster, although I'd expect `(k<<1) + (k/4)` to win most of the time. But there are many other considerations. — chux - Reinstate Monica, Jun 01 '14 at 21:20
This really is too vague a question. Answer could depend on compiler, CPU architecture, variable type, and other factors. But since a multiplication and a division and very similar operations, why would one plus another operation be quicker than the operation by itself? — AntonH, Jun 01 '14 at 21:21
Doesn't it mainly depend on the type of `k`? If `k` is an int, it will be promoted to `float` (or `double`) and multiplied. If it's a float or double, bit shifting is *expensive*. — Jongware, Jun 01 '14 at 21:22
The first one is more readable, so stick with that until _[profiling](http://en.wikipedia.org/wiki/Profiling_(computer_programming))_ proves that the line needs improvement. — user3386109, Jun 01 '14 at 21:23
Also, it you're willing to use bitshift, why not both ways? `k = (k<<1) + (k>>2);`? — AntonH, Jun 01 '14 at 21:23
@AntonH: that will round the wrong way for `k=2` and `k=3`. (But it's noteworthy converting the float result of a mul back to int will also `floor` it.) — Jongware, Jun 01 '14 at 21:25
@AntonH On a VLIW architecture I'm working on, addition and shifting can be issued in parallel but not two additions. Thus I'd do it `k = (k+k) + (k>>2)` on that machine. — Iwillnotexist Idonotexist, Jun 01 '14 at 21:25
@Jongware Since he's mentioning integers in his question, there's loss somewhere anyways. I was just giving an alternative possibility. — AntonH, Jun 01 '14 at 21:26
@IwillnotexistIdonotexist That's why my first comment (above) said it was much too vague of a question, and said that it would depend, amongst other things, on CPU architecture. But interesting into on VLIW archi. — AntonH, Jun 01 '14 at 21:27
Since this is highly architecture-dependent, I would let the compiler optimize the code. I am quite sure a compiler is smarter than most humans on modern architectures because of the many constraints from instruction set, pipelining, instruction units etc. — Jens, Jun 01 '14 at 21:27
These seem to apply to C++ as well: http://stackoverflow.com/questions/1168451/is-shifting-bits-faster-than-multiplying-and-dividing-in-java-net — Csq, Jun 01 '14 at 21:28
I like how the downvotes have been withdrawn nearly as fast as they came. Now on to actually answering the question. — Iwillnotexist Idonotexist, Jun 01 '14 at 21:31
@IwillnotexistIdonotexist They've not been withdrawn, the downvotes and upvotes balance out. — AntonH, Jun 01 '14 at 21:32
When using the second version (`k = (k<<1) + (k/4);`) do not forget to leave a comment what the code does! — Csq, Jun 01 '14 at 21:40
I have done profiling for what OP is trying to do and have screenshots/code ready to post, but cannot since the question is on-hold. In short, bitwise shifting seems to be about twice as fast as regular multiplication, at least for the case of multiplying by 2.25. However, the time spent shifting and bit-wise multiplying was miniscule compared to, say, function entry, so unless you're doing a LOT of it, it's probably not worth the effort. — Chris Middleton, Jun 01 '14 at 22:20
UPDATE: So I just tried doing both with 10 million multiplications, this time putting the loops in the functions so the results weren't obscured by all the function entry and leaving... and the results were that each method (regular and bitwise) took ~ 52 milliseconds of time. So at least for relatively large but not gigantic number of calculations, the results are the same. I will try a larger amount and report back. [This is using Instruments on a 2009 Macbook.] — Chris Middleton, Jun 01 '14 at 22:28
UPDATE #2 (last one): After running it again, this time multiplying the numbers 100 million through 500 million each 10 times with each method, there was no discernible difference. In fact, the bitwise method came out just slightly slower (18728ms vs 18223 ms), but this is probably because I moved my mouse by mistake during its run. In conclusion, the compiler or processor is probably doing similar optimizations, and even if not, the only tangible benefit of doing bitwise shifts (at least on my computer) is the feeling of self-pity you get when looking at your code months later. — Chris Middleton, Jun 01 '14 at 22:43
@AmadeusDrZaius +1, precisely what this question needs as an answer. Out of curiosity did you look at the disassembly? I can't imagine a `cvtsi2sd`, a `mulsd` and then a `cvtsd2si` being faster than a `shr` issued concurrently with an `add`/`lea`, followed by an `add`, especially not after looking at Agner Fog's instruction tables. `mulsd` has a latency of 5, and `cvtsi2sd`/`cvtsd2si` a latency of 3 each! — Iwillnotexist Idonotexist, Jun 01 '14 at 22:48
@IwillnotexistIdonotexist Interestingly, I just did it again, this time with my `bitwisemultiplyloop` function executing first, followed by `regularmultiplyloop`... and the bitwisemultiply still comes out as a little *slower*. About 3% slower in fact. My guess is that this is because the bitwise method requires some shuffling back and forth of variables into the registers (since it's 3 operations), whereas the multiply only requires one operation. As such, using the native multiply seems to be actually *faster* if you have to shift more than once. So number of operations trumps latency here. — Chris Middleton, Jun 01 '14 at 22:56
I am puzzled how this is on hold for the reason given. You have a couple of issues. Your first example uses double arithmetic. Your second uses integer. I also note that you could have done your division by 4 as k>>2. It is a bad practice to do double operations then store in an integer value without explicit conversion. Unless you are doing a whole lot of operations, there is no need to use an unclear method t save time, — user3344003, Jun 01 '14 at 23:29
@user3344003 _'I am puzzled how this is on hold for the reason given.'_ And so are we! I've voted for reopening the question and giving a comprehensible answer, where choosing from these styles might matter or not. — πάντα ῥεῖ, Jun 02 '14 at 00:46
The issue is that it was voted to close for several reasons. I voted to close due to duplication (see my above link). — Dave S, Jun 02 '14 at 01:57
This is a nice example, but why are you asking specifically about 2.25? Will there be another question shortly, asking how to do multiplication by 3.25 (one cannot use the right bit shift, but can use `(k >> 1) + k`, for example); or 2.75; or 4.50 (multiply by 3 then shift left?). I strongly feel that in general readability is way more important than these micro-micro-optimizations (you need a comment to explain what's going on in `(k + k) + (k << 2)`) and @AmadeusDrZaius' excellent test shows that even for a large number of operations there is nothing to be gained here. — CompuChip, Jun 02 '14 at 06:56
Hmm typed a long comment, but actually this answer from the "Related questions" bar to the right sums it up rather nicely, IMO: http://stackoverflow.com/a/6357114/2920343 — CompuChip, Jun 02 '14 at 06:58
note that `k<<1` is equivalent to `k*2`, and `k>>2` is equivalent to `k/4` (this is the definition of those operators in the standard)- using the shift notation isn't a magical speed gain; just write the clearest code — M.M, Jun 02 '14 at 07:12
@DaveS not a duplicate, the other question is comparing *integer* multiplies not floating point. It makes all the difference. — Mark Ransom, Jun 02 '14 at 18:25

Chris Middleton · Answer 1 · 2014-06-02T18:40:25.327

The first attempt

I defined functions regularmultiply() and bitwisemultiply() as follows:

int regularmultiply(int j)
{
    return j * 2.25;
}

int bitwisemultiply(int k)
{
    return (k << 1) + (k >> 2);
}

Upon doing profiling with Instruments (in XCode on a 2009 Macbook OS X 10.9.2), it seemed that bitwisemultiply executed about 2x faster than regularmultiply.

enter image description here

The assembly code output seemed to confirm this, with bitwisemultiply spending most of its time on register shuffling and function returns, while regularmultiply spent most of its time on the multiplying.

regularmultiply:

enter image description here

bitwisemultiply:

enter image description here

But the length of my trials was too short.

The second attempt

Next, I tried executing both functions with 10 million multiplications, and this time putting the loops in the functions so that all the function entry and leaving wouldn't obscure the numbers. And this time, the results were that each method took about 52 milliseconds of time. So at least for a relatively large but not gigantic number of calculations, the two functions take about the same time. This surprised me, so I decided to calculate for longer and with larger numbers.

The third attempt

This time, I only multiplied 100 million through 500 million by 2.25, but the bitwisemultiply actually came out slightly slower than the regularmultiply.

The final attempt

Finally, I switched the order of the two functions, just to see if the growing CPU graph in Instruments was perhaps slowing the second function down. But still, the regularmultiply performed slightly better:

enter image description here

Here is what the final program looked like:

#include <stdio.h>

int main(void)
{
    void regularmultiplyloop(int j);
    void bitwisemultiplyloop(int k);

    int i, j, k;

    j = k = 4;
    bitwisemultiplyloop(k);
    regularmultiplyloop(j);

    return 0;
}

void regularmultiplyloop(int j)
{
    for(int m = 0; m < 10; m++)
    {
        for(int i = 100000000; i < 500000000; i++)
        {
            j = i;
            j *= 2.25;
        }
        printf("j: %d\n", j);
    }
}

void bitwisemultiplyloop(int k)
{
    for(int m = 0; m < 10; m++)
    {
        for(int i = 100000000; i < 500000000; i++)
        {
            k = i;
            k = (k << 1) + (k >> 2);
        }
        printf("k: %d\n", k);
    }
}

Conclusion

So what can we say about all this? One thing we can say for certain is that optimizing compilers are better than most people. And furthermore, those optimizations show themselves even more when there are a lot of computations, which is the only time you'd really want to optimize anyway. So unless you're coding your optimizations in assembly, changing multiplication to bit shifting probably won't help much.

It's always good to think about efficiency in your applications, but the gains of micro-efficiency are usually not enough to warrant making your code less readable.

You may just be testing the compiler's ability to recognize that only the final result from the loop is kept, throwing away the looping altogether. You need to check the assembly output for those cases too. — Mark Ransom, Jun 02 '14 at 18:19
@MarkRansom If you look at the code, it doesn't multiply the same value repeatedly. It does 100 million through 500 million sequentially and then repeats. — Chris Middleton, Jun 02 '14 at 18:30
@MarkRansom The performance that was measured is in the expected range for the loops being present. If the compiler had optimized them out, the runtime would have been much shorter. — cmaster - reinstate monica, Jun 02 '14 at 18:32
Did you take a look at the assembler code *after* you moved the loops into the functions? — cmaster - reinstate monica, Jun 02 '14 at 18:33
@cmaster I did but I'm not experienced enough to make any real judgements about it. Here's the assembly for [`bitwisemultiplyloop`](http://i.stack.imgur.com/DOnfj.png) and [`regularmultiplyloop`](http://i.stack.imgur.com/EnjBz.png). If you have anything to add, I can make it a community wiki. — Chris Middleton, Jun 02 '14 at 18:41
The assembler code in those two images looks like you compiled with `-O0`. Consequently, the bitwise multiply does more stack accesses as it manipulates more intermediate values. And that makes it slow, even though these stack accesses are completely unnecessary. If you compile with `-O2` or `-Os`, the picture should change drastically. — cmaster - reinstate monica, Jun 02 '14 at 18:58
I didn't compile with any special arguments -- is `-00` a low optimization setting? I thought high optimization was the default behavior. — Chris Middleton, Jun 02 '14 at 19:02
I see; thanks for the info. I think the `-02` and `-0s` settings optimized away the loop completely (or at least, it seems that way) since I was only printing the final multiplication. — Chris Middleton, Jun 02 '14 at 19:21

Ruslan Gerasimov · Answer 2 · 2014-06-02T13:27:13.783

Indeed it depends on a variety of factors. So I have just checked it by running and measuring time. So the string we are interested in takes only a few instructions of CPU which is very fast so I have wrapped it into the cycle - multiplied the execution time of one code by a big number, and I got the k *= 2.25; is about in 1.5 times slower than k = (k<<1) + (k/4);. Here is my two codes to comapre:

prog1:

#include <iostream>
using namespace std;

int main() {

int k = 5;
for (unsigned long i = 0; i <= 0x2fffffff;i++)
 k = (k<<1) + (k/4);
cout << k << endl;

return 0;
}

prog 2:

#include <iostream>
using namespace std;

int main() {

int k = 5;
for (unsigned long i = 0; i <= 0x2fffffff;i++)
 k *= 2.25;
cout << k << endl;

return 0;
}

Prog1 takes 8 secs and Prog2 takes 14 secs. So by running this test with you architecture and compiler you can get the result which is correct to your particular environment.

cmaster - reinstate monica · Answer 3 · 2014-06-02T19:35:29.393

That depends heavily on the CPU architecture: Floating point arithmetic, including multiplications, has become quite cheap on many CPUs. But the necessary float->int conversion can bite you: on POWER-CPUs, for instance, the regular multiplication will crawl along due to the pipeline flushes that are generated when a value is moved from the floating point unit to the integer unit.

On some CPUs (including mine, which is an AMD CPU), this version is actually the fastest:

k *= 9;
k >>= 2;

because these CPUs can do a 64 bit integer multiplication in a single cycle. Other CPUs are definitely slower with my version than with your bitshift version, because their integer multiplication is not as heavily optimized. Most CPUs aren't as bad on multiplications as they used to be, but a multiplication can still take more than four cycles.

So, if you know which CPU your program will run on, measure which is fastest. If you don't know, your bitshift version won't perform badly on any architecture (unlike both the regular version and mine), which makes it a really safe bet.

score 1 · Answer 4 · answered Jun 02 '14 at 06:34

1

It highly depends on what hardware are you using. On modern hardware floating point multiplications may run way faster than integer ones, so you might want to change the entire algorithm and start using doubles instead of integers. If you're writing for modern hardware and you have a lot of operations like multiplying by 2.25, I'd suggest using double rather than integers, if nothing else prevents you from doing that.

And be data driven - measure performance, because it's affected by compiler, hardware and your way of implementing your algorithm.

answered Jun 02 '14 at 06:34

Alex

2,916
3
22
27

2

*"On modern hardware floating point multiplications may run way faster than integer ones"*... really? Do you have a reference for this? I was never aware of this... – user541686 Jun 02 '14 at 07:16
Sure. Just made measurements on my Mac Book: https://gist.github.com/avshabanov/b4960e95c8b68575ad27 Note: For fair comparison I compared double multiplication vs long long multiplication+shift right to emulate non-integer multiplications with integer primitive types. – Alex Jun 02 '14 at 17:35
1

@Mehrdad It's certainly possible if you consider auto-vectorization. At the extreme, a Haswell processor can do 8 DP-MUL/cycle via dual-issue AVX multiply. But Haswell can only do one 64-bit integer multiply each cycle. There is no SIMD for 64-bit integer multiply. – Mysticial Jun 02 '14 at 18:47
@Mehrdad Yes, he is right that floating point multiplications can be faster than integer multiplications. Probably has to do with the fact that floating point multiplications are essential to get high machoflop... pardon, gigaflop numbers. – cmaster - reinstate monica Jun 02 '14 at 18:49
@Mehrdad A 64-bit integer multiplication has to compute 128 bits of results. A double-precision multiplication only needs to compute a 53-bit significand. The influence of the exponent parts over the result is trivial to compute. – Pascal Cuoq Jun 02 '14 at 18:50
@PascalCuoq Sorry to contradict, but you can completely forget about those high order bits, you only need to calculate the 64 bits of your result. – cmaster - reinstate monica Jun 02 '14 at 18:52
@cmaster To get the properly rounded 64-bit result, are not all 128 bits needed? – chux - Reinstate Monica Jun 02 '14 at 19:38
@cmaster The traditional IA-32 and x86-64 integer multiplication instructions that I know of taking N-bit operands must compute a 2N-bit result, because that's what the spec says they do: http://faydoc.tripod.com/cpu/imul.htm . I am not up-to-date on newer instructions though, what instructions are you referring to? – Pascal Cuoq Jun 02 '14 at 19:47
@chux No. If you do a, say five digit multiplication by hand, you will see that you never use information from left of the fifths digit to calculate a digit on the right. Lower order influences higher order, but not the other way round. And to the CPU, there is not even a difference between a signed and an unsigned multiplication. So it really can just cut of at that point. – cmaster - reinstate monica Jun 02 '14 at 19:47
@cmaster Of course you are correct. I was thinking of a `double` significand needing extended product bits. – chux - Reinstate Monica Jun 02 '14 at 19:51
@PascalCuoq I can't find the requirement of having to calculate 2N-bit result in the link you gave. But I was not talking about a concrete implementation, I was talking about the mechanics of a fixed size integer multiplication. And, as it happens, you really don't need more significant bits than the resulting integer contains. Any spec that talks about caculating a 2N bits intermediate result will only do so for clarity of presentation. And if actual hardware does calculate those bits, it will only do so to support an instruction to compute the high order bits instead of the low order ones. – cmaster - reinstate monica Jun 02 '14 at 20:03
@cmaster Sorry, it is not expressed more clearly than in “and the product is stored in the AX, DX:AX, or EDX:EAX registers, respectively”. And I **was** talking about the concrete implementation in the processor on your desktop, that actually **has** to compute the full 128-bit product when you call the `IMUL` instruction in order to store it in the DX and AX registers, not about a theoretical fixed-size integer instruction that I do not know to exist in the widespread x86-64 instruction set (but perhaps it does. You just haven't given me its mnemonic). – Pascal Cuoq Jun 02 '14 at 20:09
@Mysticial: I was asking about a *single* integer multiplication compared to a *single* floating-point multiplication. It's pretty obvious that if you have more multiplication units in the CPU of one kind then you'll get more throughput for that kind (which is not a terribly insightful observation), but that's not what my question was. I was asking about how long it takes to do a *single* multiply. – user541686 Jun 02 '14 at 21:12
@Alex: I'm not sure how convincing your benchmark is. Your array is twice as big in the second scenario than the first, so there are necessarily more memory accesses there. (Sure, it hits the cache, but it still adds delay.) I'm not sure if you're correct or not but your benchmark certainly doesn't seem to prove it... – user541686 Jun 02 '14 at 21:15
@cmaster: It's good to know he's right, but I'm looking for something more convincing than a mere assertion =P so far his benchmark doesn't look all that convincing. – user541686 Jun 02 '14 at 21:16
@PascalCuoq Ah, I see, so it's actually a multiplication with two 64 bit inputs and one 128 bit result. Nice to know. On other architectures, there is no such instruction. The PowerPC, for instance, had only a multiplication with two 32 bit inputs and one 32 bit output. Consequently, it didn't have to calculate the 32 high order bits. But thanks for bringing me up to scratch on X86. – cmaster - reinstate monica Jun 04 '14 at 18:41

C/C++: Multiply, or bitshift then divide?

4 Answers4

The first attempt

The second attempt

The third attempt

The final attempt

Conclusion

Linked