0

What generally is a faster solution, multiplying or bit shifting?

If I want to multiply by 10000, which code would be faster?

v = (v<<13) + (v<<11) + (v<<4) - (v<<8);

or

v = 10000*v;

And the second part of the question - How to find the lowest number of shifts required to do some multiplication? (I'm intereseted in multiplying by 10000, 1000 and 100).

Kelu Thatsall
  • 2,494
  • 1
  • 22
  • 50
  • 2
    The answer (to the first part of the question) depends on the processor. Why not leave that decision to the compiler, which knows the target architecture? – user4815162342 Feb 27 '14 at 12:39
  • Will the compiler do that always in the fastest way? – Kelu Thatsall Feb 27 '14 at 12:40
  • 4
    He is not saying that he is optimizing code, try and get your limited mind around the fact that some people are simply curious and want to learn, you know, for the sake of learning. – DusteD Feb 27 '14 at 12:43
  • @user4815162342 is right. it depends on the machine. for yours why not do a benchmark test. – Guy Feb 27 '14 at 12:45
  • @KeluThatsall yes compiler will almost always do that in the fastest way. (Considering it is a competent compiler though.) – Guy Feb 27 '14 at 12:47
  • 2
    I suppose I had that coming... no offense meant. I perfectly understand wanting to learn. But "faster" is a freaking rabbit hole, it does not permit a simple and satisfactory answer, only "it depends" and "in specific circumstances X, Y, Z under assumptions A and B, [yes/no]". The only workable approach is to (1) understand as many of the things involved (that means *everything*, from theoretical CS down machine code and logic gates) and their interactions, and then (2) use logic and the scientific method to derive answers for specific circumstances. You're asking the wrong questions. –  Feb 27 '14 at 12:50
  • possible duplicate of [Is multiplication and division using shift operators in C actually faster?](http://stackoverflow.com/questions/6357038/is-multiplication-and-division-using-shift-operators-in-c-actually-faster) – Tony Delroy Feb 27 '14 at 13:20
  • @DusteD You are being offensive for no good reason. In the context of portable C programming, and without specifying the target architecture, the question cannot be answered as posed. Also, the OP was asking which version is "faster" at the obvious expense of clarity, which is by definition optimizing code. – user4815162342 Feb 27 '14 at 17:08

4 Answers4

4

It really depends on the architecture of the processor, as well as the compiler that you're using.

But you can simply view the dis-assembly of each option, and see for yourself.

Here is what I got using Visual-Studio 2010 compiler for Pentium:

int v2 = (v<<13) + (v<<11) + (v<<4) - (v<<8);
mov         eax,dword ptr [v]  
shl         eax,0Dh  
mov         ecx,dword ptr [v]  
shl         ecx,0Bh  
add         eax,ecx  
mov         edx,dword ptr [v]  
shl         edx,4  
add         eax,edx  
mov         ecx,dword ptr [v]  
shl         ecx,8  
sub         eax,ecx  
mov         dword ptr [v2],eax  

int v2 = 10000*v;
mov         eax,dword ptr [v]  
imul        eax,eax,2710h  
mov         dword ptr [v2],eax  

So it appears that the second option is faster in my case.

BTW, you might get a different result if you enable optimization (mine was disabled)...

barak manos
  • 29,648
  • 10
  • 62
  • 114
2

To the first question: Don't bother. The compiler knows better and will optimize it according to the respective target hardware.

To the second question: Look at the binary representation:

For example: bin(10000) = 0b10011100010000:

 1  0  0  1  1  1  0  0  0  1  0  0  0  0
13 12 11 10  9  8  7  6  5  4  3  2  1  0

So you have to shift by 13, 10, 9, 8 and 4. If you want to shortcut consecutive ones (by subtracting as in your question) you need at least three consecutive ones in order to gain anything.

But again, let the compiler do this. It's his job.

bitmask
  • 32,434
  • 14
  • 99
  • 159
  • 1
    Thanks, I wondered if there is some algorithm to find those. But if you say compiler knows best then I'll just leave it to him ;) – Kelu Thatsall Feb 27 '14 at 12:46
  • @KeluThatsall: Regarding an algorithm see the edit. It's a lot more complex than what you can possibly gain. – bitmask Feb 27 '14 at 12:50
1

there is only one situation in which shift operation are faster than *, and it's defined by two condition:

  1. the operation is with a value power of two
  2. when you multiply with a fractional number -> division.

Let's look a little deeper:

  • multiplication/division, shift operation are done by units in the HW architecture; usually you have shifters, multipliers/dividers to perform these operations but each of the operation is performed by a different set of registers inside a Arithmeric Locgic Unit.
  • multiplication/division with a power of two is equivalent to a left_shift/right_shift operation
  • if you are not dealing with power of 2 than multiplication and division are performed slightly differently:

    • Multiplication is performed by the HW ( ALU unit) in a single instrucion (depending on the data type but let's not overcomplicate things)
    • Division is performed in a loop as consecutive subtractions -> more than one instruction

Summarizing:

  1. multiplication is only one instruction; while replacing multiplication with a series of shift operations is multiple instruction -> the first option is faster (even on a parallel architecture)

  2. multiplication with a power of two is the same as a shift operation; the compiler usually generates a shift when it detects this in the code.

  3. division is multiple instruction; replaving this with a series of shifts might prove faster but it depends on each situation.

  4. division with a power of two is multiple operations and can be replaced with a single right_shift operation; a smart compiler will do this automatically

Pandrei
  • 4,843
  • 3
  • 27
  • 44
  • The logic of summary point #1 is faulty. Multiple instruction are *not* necessarily slower than a single instruction that does something completely different. Different instructions can have radically different latency, throughput, and pipeline behavior. As an extreme example, consider register moves (virtually free) and reading from RAM with a cache miss (hundreds of cycles of latency, and probably lower throughput too). –  Feb 27 '14 at 15:30
  • I did not want to overcomplicate things and get into a lot of architecture details BUT you can safely assume that the same premises are true for both cases; i.e. you can suffer from pipeline stalls, cache misses, high register pressure no matter the number of instructions you execute. But give the same premises(when you compare apple to apple); on instruction executes faster then multiple instructions no matter what the architecture. – Pandrei Feb 27 '14 at 15:57
1

An older Microsoft C compiler optimized the shift sequence using lea (load effective address), which allows multiples of 5:

        lea     eax, DWORD PTR [eax+eax*4]  ;eax = v*5
        lea     ecx, DWORD PTR [eax+eax*4]  ;ecx = v*25
        lea     edx, DWORD PTR [ecx+ecx*4]  ;edx = v*125
        lea     eax, DWORD PTR [edx+edx*4]  ;eax = v*625
        shl     eax, 4                      ;eax = v*10000

multiply (signed or unsigned) was still faster on my system with Intel 2600K 3.4ghz. Visual Studio 2005 and 2012 multiplied v*10256, then subtracted (v<<8). Shift and add / subtract sequence was slower than the lea method above:

        shl     eax,4      ;ecx = v*(16)
        mov     ecx,eax
        shl     eax,4      ;ecx = v*(16-256)
        sub     ecx,eax
        shl     eax,3      ;ecx = v*(16-256+2048)
        add     ecx,eax
        shl     eax,2      ;eax = v*(16-256+2048+8192) = v*(10000)
        add     eax,ecx
rcgldr
  • 27,407
  • 3
  • 36
  • 61