This attempt to microbenchmark is too naive in almost every way possible for you to get any meaningful results.
Even if you fixed the surface problems (so the code didn't optimize away), there are major deep problems before you can conclude anything about when your asm
would be better than *
.
(Hint: probably never. Compilers already know how to optimally multiply integers, and understand the semantics of that operation. Forcing it to use imul
instead of auto-vectorizing or doing other optimizations is going to be a loss.)
Both timed regions are empty because both multiplies can optimize away. (The asm
is not asm volatile
, and you don't use the result.) You're only measuring noise and/or CPU frequency ramp-up to max turbo before the clock()
overhead.
And even if they weren't, a single imul
instruction is basically unmeasurable with a function with as much overhead as clock()
. Maybe if you serialized with lfence
to force the CPU to wait for imul
to retire, before rdtsc
... See RDTSCP in NASM always returns the same value
Or you compiled with optimization disabled, which is pointless.
You basically can't measure a C *
operator vs. inline asm without some kind of context involving a loop. And then it will be for that context, dependent on what optimizations you defeated by using inline asm. (And what if anything you did to stop the compiler from optimizing away work for the pure C version.)
Measuring only one number for a single x86 instruction doesn't tell you much about it. You need to measure latency, throughput, and front-end uop cost to properly characterize its cost. Modern x86 CPUs are superscalar out-of-order pipelined, so the sum of costs for 2 instructions depends on whether they're dependent on each other, and other surrounding context. How many CPU cycles are needed for each assembly instruction?
The stand-alone definitions of the functions are identical, after your change to let the compiler pick registers, and your asm could inline somewhat efficiently, but it's still optimization-defeating. gcc knows that 5*4 = 20 at compile time, so if you did use the result multiply(4,5)
could optimize to an immediate 20
. But gcc doesn't know what the asm does, so it just has to feed it the inputs at least once. (non-volatile
means it can CSE the result if you used asmMultiply(4,5)
in a loop, though.)
So among other things, inline asm defeats constant propagation. This matters even if only one of the inputs is a constant, and the other is a runtime variable. Many small integer multipliers can be implemented with one or 2 LEA instructions or a shift (with lower latency than the 3c for imul
on modern x86).
https://gcc.gnu.org/wiki/DontUseInlineAsm
The only use-case I could imagine asm
helping is if a compiler used 2x LEA instructions in a situation that's actually front-end bound, where imul $constant, %[src], %[dst]
would let it copy-and-multiply with 1 uop instead of 2. But your asm removes the possibility of using immediates (you only allowed register constraints), and GNU C inline can't let you use a different template for immediate vs. register arg. Maybe if you used multi-alternative constraints and a matching register constraint for the register-only part? But no, you'd still have to have something like asm("%2, %1, %0" :...)
and that can't work for reg,reg.
You could use if(__builtin_constant_p(a)) { asm using imul-immediate } else { return a*b; }
, which would work with GCC to let you defeat LEA. Or just require a constant multiplier anyway, since you'd only ever want to use this for a specific gcc version to work around a specific missed-optimization. (i.e. it's so niche that in practice you wouldn't ever do this.)
Your code on the Godbolt compiler explorer, with clang7.0 -O3
for the x86-64 System V calling convention:
# clang7.0 -O3 (The functions both inline and optimize away)
main: # @main
push rbx
sub rsp, 16
call clock
mov rbx, rax # save the return value
call clock
sub rax, rbx # end - start time
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp + 8], xmm0 # 8-byte Spill
call clock
mov rbx, rax
call clock
sub rax, rbx # same block again for the 2nd group.
xorps xmm0, xmm0
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp], xmm0 # 8-byte Spill
mov edi, offset .L.str
mov al, 1
movsd xmm0, qword ptr [rsp + 8] # 8-byte Reload
call printf
mov edi, offset .L.str.1
mov al, 1
movsd xmm0, qword ptr [rsp] # 8-byte Reload
call printf
xor eax, eax
add rsp, 16
pop rbx
ret
TL:DR: if you want to understand inline asm performance on this fine-grained level of detail, you need to understand how compilers optimize in the first place.