12

Which one is faster -

val = val*10;

or

val = (val<<3) + (val<<2);

How many clock cycles does imul take when compared to shift instruction?

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
Kartlee
  • 1,129
  • 2
  • 19
  • 33
  • 2
    Similar questions have been asked before -- the end result: "look at the instruction documentation provided by the manufacture" (this will vary per model) and "it's complicated" (modern super-scalar out-of-order pipelined CPUs can't just be tallied-up like a "traditional" microprocessor). –  May 25 '11 at 06:11
  • 1
    At the very least, include the target CPU *model*. This *may* actually lead to something somewhat interesting. –  May 25 '11 at 06:13
  • 2
    the more reasons to let the compiler worry about it: writing it the standard way is more readable (for humans), and there is more chance the compiler will do the right thing for the architecture at hand. – Henno Brandsma May 25 '11 at 06:14
  • 3
    I can't say this enough: Measure it! And lookup your CPU docs. – Michael Foukarakis May 25 '11 at 06:18
  • You are writing in C but talking about asm. Compile with optimization enabled, and you will see that compilers do this for you. – Peter Cordes Aug 19 '17 at 16:14

4 Answers4

68

This is the 21st century. Modern hardware and compilers know how to produce highly optimised code. Writing multiplication using shifts won't help performance but it will help you to produce code with bugs in.

You have demonstrated this yourself with code that multiplies by 12 rather than 10.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • 22
    +1 for "You have demonstrated this yourself with code that multiplies by 12 rather than 10". A most excellent answer! – Ben May 25 '11 at 09:36
11

I'd say, just write val = val * 10; or val *= 10;, and let the compiler worry about such questions.

Henno Brandsma
  • 2,116
  • 11
  • 12
  • 4
    +1 "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" – ThiefMaster May 25 '11 at 06:12
  • 2
    Especially considering that this optimization can be done easily by the compiler if wanted (Warren has a whole chapter about that stuff in Hacker's delight). – Voo May 25 '11 at 11:45
9

In this case they probably take the same amount of cycles, though your manual "optimization" needs one more register (which can slow down the surrounding code):

val = val * 10;
lea    (%eax,%eax,4),%eax
add    %eax,%eax

vs

val = (val<<3) + (val<<1);
lea    (%eax,%eax,1),%edx
lea    (%edx,%eax,8),%eax

The compiler knows how to do strength reduction, and probably much better than you. Also, when you port your code to other platform (say, ARM), the compiler knows how to do strenght reduction on that platform too (x86's LEA provides different optimization opportunities than ARM's ADD and RSB).

ninjalj
  • 42,493
  • 9
  • 106
  • 148
3

Doing silly "optimizations" like this by hand in a high-level language will accomplish nothing but showing people you're out of touch with modern technology and programming practices.

If you were writing in assembly directly, it would make sense to worry about this, but you're not.

With that said, there are a few cases where the compiler won't be able to optimize something like this. Consider an array of possible multiplicative factors, each consisting of exactly 2 nonzero bits, with code like:

x *= a[i];

If profiling shows this to be a major bottleneck in your program, you might consider replacing that by:

x = (x<<s1[i]) + (x<<s2[i]);

as long as you plan to measure the results. However I suspect it's rare to find a situation where this would help, or where it would even be possible. It's only plausible on a CPU with a weak multiply unit compared to shifts and total instruction throughput.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • Your two-shifts idea could help on a CPU with very slow multiplies but fast variable-count shifts, especially if `x` is `__int128`, or even `int64_t` on a 32-bit CPU. (Although doing two extended-precision shifts is not wonderful either.) Intel's original Pentium (P5) had `imul` that takes 9c. Unfortunately `shl r32, cl` takes 4c, so this idea is a non-starter on P5. In-order Atom (pre-Silvermont) has 13c latency (and not pipelined) for `imul r64, r64`, but 1c latency for `shl r64, cl`, so this could win there. – Peter Cordes Aug 19 '17 at 16:08
  • Especially if you pack the two shift counts into a 16-bit struct, so instead of two separate loads, it's `movzx ecx, word s[i]` / `shl rax, cl` / `shr ecx, 8` / ... / `shl rbx, cl` / `add rax, rbx`. (I think you want `+` rather than `|` to combine the two partial-multiply results). – Peter Cordes Aug 19 '17 at 16:09