This is craptastic un-optimized code because you compiled with -O0
(compile fast, skip most optimization passes). The legacy stack-frame setup / cleanup is just noise. The arg is on the stack right above the return address, i.e. at 4(%esp)
on function entry. (See also How to remove "noise" from GCC/clang assembly output?)
It's surprising to see a compiler use 3 instructions to multiply by shifting and adding, instead of an imull $34, 4(%esp), %eax
/ ret
, unless tuning for old CPUs. 2 instructions is the cutoff for modern gcc and clang with their default tuning. See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?
But this can be done with 2 instructions using LEA (not counting mov
to copy a register); the code is bloated because you compiled without optimization. (Or you tuned for an old CPU where there's maybe some reason to avoid LEA.)
I think you must have used gcc for this; disabling optimization with other compilers always just uses imul
to multiply by a non-power-of-2. But I can't find a gcc version + options on the Godbolt compiler explorer that gives exactly your code. I didn't try every possible combination. MSVC 19.10 -O2
uses the same algorithm as your code, including loading a
twice.
Compiling with gcc5.5 (which is the newest gcc that doesn't just use imul
, even at -O0
), we get something like your code, but not exactly. (Same operations in a different order, and not loading a
from memory twice).
# gcc5.5 -m32 -xc -O0 -fverbose-asm -Wall
func:
pushl %ebp #
movl %esp, %ebp #, # make a stack frame
movl 8(%ebp), %eax # a, tmp89 # load a from the stack, first arg is at EBP+8
addl %eax, %eax # tmp91 # a*2
movl %eax, %edx # tmp90, tmp92
sall $4, %edx #, tmp92 # a*2 << 4 = a*32
addl %edx, %eax # tmp92, D.1807 # a*2 + a*32
popl %ebp # # clean up the stack frame
ret
Compiling with optimization with the same older GCC version on the Godbolt compiler explorer: gcc5.5 -m32 -O3 -fverbose-asm
, we get:
# gcc5.5 -m32 -O3. Also clang7.0 -m32 -O3 emits the same code
func:
movl 4(%esp), %eax # a, a # load a from the stack
movl %eax, %edx # a, tmp93 # copy it to edx
sall $5, %edx #, tmp93 # edx = a<<5 = a*32
leal (%edx,%eax,2), %eax # eax = edx + eax*2 = a*32 + a*2 = a*34
ret # with a*34 in EAX, the return-value reg in this calling convention
With gcc 6.x or newer, we get this efficient asm: imul
-immediate with a memory source decodes to only a single micro-fused uop on modern Intel CPUs, and integer multiply only has 3 cycle latency on Intel since Core2 and AMD since Ryzen. (https://agner.org/optimize/).
# gcc6/7/8 -m32 -O3 default tuning
func:
imull $34, 4(%esp), %eax #, a, tmp89
ret
But with -mtune=pentium3
, we strangely don't get an LEA. This looks like a missed optimization. LEA has 1-cycle latency on Pentium 3 / Pentium-M.
# gcc8.2 -O3 -mtune=pentium3 -m32 -xc -fverbose-asm -Wall
func:
movl 4(%esp), %edx # a, a
movl %edx, %eax # a, tmp91
sall $4, %eax #, tmp91 # a*16
addl %edx, %eax # a, tmp92 # a*16 + a = a*17
addl %eax, %eax # tmp93 # a*16 * 2 = a*34
ret
This is the same as your code, but uses a reg-reg mov
instead of reloading from the stack to add a
to the shift result.