How can the multiplication be faster than shifting bits to the left?

Question

It is well know that shifting bits to the left is faster than multiply because barrel shifters are implemented directly in the hardware. Therefore, this simple benchmark should be wrong:

$start = 1;

$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++) {
    $result2 = $start << 2;
}
echo microtime(1) - $timestart;

$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++) {
    $result1 = $start * 4;
}
echo microtime(1) - $timestart;
echo "\n";

Because I executed it multiple times and always multiplying was faster than shifting bits to the left. For example:

0.73733711242676

0.71091389656067

Therefore, or the benchmark is wrong or the PHP interpreter is doing something here. The test is executed by PHP 7.0.32 running in Ubuntu:

PHP 7.0.32-0ubuntu0.16.04.1 (cli) ( NTS )

CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz

Edit:

Executing it in a Windows box, with almost the same CPU (Intel(R) Core(TM) i5-4460S CPU @2.90GHz) the results are like expected:

0.24960112571716

0.28080010414124

The PHP version for this case is different:

PHP 7.1.19 (cli) (built: Jun 20 2018 23:24:42) ( ZTS MSVC14 (Visual C++ 2015) x64 )

If you reverse the order of the benchmarks, do you get the opposite result? If so, it's probably because your CPU doesn't ramp up to max clock speed right away or other startup overhead factors. Also, you have an `echo` inside the 2nd timed interval (which if anything you'd expect to skew the opposite direction). — Peter Cordes, Nov 11 '18 at 13:25
In which setting would you have to left shift a hundred thousand times in your typical web application? Those benchmarks are mostly measuring the loop performance / comparison / counting rather than the instruction. And even then it's negligible in comparison to the zval overhead to be meaningful. — mario, Nov 11 '18 at 13:29
@PeterCordes I tried it before asking: Reversing the order gets the same results: multiplying is faster. You are right about the "echo", but it is inside both intervals. Anyway, even removing both, multiplying is still slower. — Víctor Iglesias Castán, Nov 11 '18 at 13:30
You didn't update the numbers after changing the code, so now your numbers aren't from the version of the code in the question. — Peter Cordes, Nov 11 '18 at 13:54
Whoa, the times changed by that much when you re-ran it? That's not very consistent at all. You should be doing many more iterations. — Peter Cordes, Nov 11 '18 at 14:03
@PeterCordes Now the benchmark is with 10^2 more operations. — Víctor Iglesias Castán, Nov 11 '18 at 14:50
You say *the results are like expected* for the PHP7.1 test, but based on low-level asm / CPU microarchitecture reasoning, the shift version *is* the one that should be slower on an Intel CPU, because variable-count shifts cost extra uops and we should be throughput-bottlenecked, not latency bottlenecked. Integer multiply is not slow on modern CPUs, really only integer division still is. The massive difference in runtime for 2 nearly equal-frequency CPUs should make it obvious that it's not the multiply or shift itself that's slow or fast, but rather PHP itself. — Peter Cordes, Nov 12 '18 at 15:53

score 1 · Accepted Answer · answered Nov 11 '18 at 13:50

Your reasoning about hardware is basically irrelevant. You're using an interpreted language where most of the cost is interpreter overhead.

An asm version of either loop could run at 1 per clock (assuming a fixed-count shift), so only 100k iterations would take (on a 3GHz CPU) 0.033 ms, or 0.000033 seconds, ~250 times faster than your PHP times.

Also, an interpreted loop has to use a variable-count shift (because it can't JIT-compile the shift count into an immediate in the machine code), which is actually more expensive for throughput (3 uops) on Intel CPUs because of x86 legacy baggage (flag semantics). AMD CPUs have single-uop shifts even for variable shift counts. (shl reg, cl vs. shr reg, imm8). See INC instruction vs ADD 1: Does it matter? for more about why shl reg,cl is 3 uops on Sandybridge-family, and how it could create a false dependency through flags)

Integer multiply is 1 uop, 1-per-clock throughput, 3 cycle latency, on Intel Sandybridge-family and AMD Ryzen. I per 2 clocks on AMD Bulldozer-family, not fully pipelined. So yes, multiply has higher latency, but they're both fully pipelined for throughput. Your loop throws away the result, so there's no loop-carried dependency chain so latency is irrelevant (and hidden by out-of-order execution).

But that minor difference (2 extra uops) is not enough to account for the measured difference. The actual shift or multiply is only 1/250th of the total cycles the loop takes. You say switching the order of the loops doesn't change the result, so it's not just a warm-up effect before your CPU ramps up to max clock speed.

You haven't mentioned what CPU microarchitecture you're running on, but the answer probably doesn't depend on how shift vs. multiply instructions decode.

Ok, that's a Haswell (part of the Sandybridge family), so everything I was saying does apply. — Peter Cordes, Nov 11 '18 at 13:54

How can the multiplication be faster than shifting bits to the left?

1 Answers1