Redundant value copying in assembly?

Question

I am learning x86 assembly from the book Programming Ground Up. When introducing functions, the author gives example of a function that raises a given 4 byte integer to a power greater than 0. Here is how the function is defined (this is the version I wrote but code is almost identical):

 1. .type find_pow, @function
 2. find_pow:
 3. pushl %ebp          # Save current base pointer
 4. movl %esp, %ebp     # Copy stack pointer to base pointer
 5. movl $1, %eax        # %eax will hold the result, set it to 1
 6.
 7. subl $4, %esp
 8. movl 8(%ebp), %ebx
10. movl 12(%ebp), %ecx
11.
12. movl %ebx, -4(%ebp)
13.
14. loop_find_pow:      # Start loop
15. cmpl $1, %ecx        # If 2nd parameter equals 0
16. je exit_find_now    # Exit loop
17. movl -4(%ebp), %eax # Move current result to %eax
18. imull %ebx, %eax    # Multiply to current value of %eax
19. movl %eax, -4(%ebp) # Store current result
20. decl %ecx           # Decrease %ecx
21. jmp loop_find_pow   # Jump to loop top
22.
23. exit_find_now:
24. movl -4(%ebp), %eax
25. movl %ebp, %esp
26. popl %ebp
27. ret

I understand the code completely but a little confused about why the author is doing what he's doing in lines 17...19. First, the value of the local variable is copied to %eax from stack. Then the computation is done using %eax (I suppose this is for performance reasons?) and then the calculated value is stored back to space allocated for the only local variable. I fail to understand why all this copying is required. The caller obtains the return value by examining the %eax register anyway. Doesn't it make sense to completely eliminate use of the local variable (at address -4(%ebp)?

Edit: If anyone wants to look at the original code in the book, browse to page 67.

Please don't use line-numbers in code you show us. Sometimes we might want to copy the code and try it out ourselves, and with embedded line-numbers it's just to much work to remove them. Instead indicate important parts of the code with comments. — Some programmer dude, Jul 26 '19 at 11:41
Your version destroys the caller's `ebx` value. That register is call-preserved in the i386 System V ABI (which Linux uses). You didn't need to copy to `-4(%ebp)`, you could have just modified `8(%ebp)`: according to the calling convention you "own" your stack args and can modify that copy of your variables. Or better, load it into `%eax` and the re-read it from memory repeatedly. — Peter Cordes, Jul 26 '19 at 17:51

score 5 · Accepted Answer · answered Jul 26 '19 at 11:41

Doesn't it make sense to completely eliminate use of the local variable (at address -4(%ebp)?

Yes.

Because eax is a volatile register (defined by the ABI), the function need not preserve its original value.

An algorithm this simple can keep all of its state in registers, without having to spill them onto the stack.

However, this is a simple example intended to teach; one of the things it demonstrates is how to use stack variables when you don't have enough registers.

Peter Cordes · Answer 2 · 2019-07-26T19:22:40.917

Your version destroys the caller's ebx value. That register is call-preserved in the i386 System V ABI (which Linux uses). Pick a different tmp reg like EDX.

EAX, ECX, and EDX are call-clobbered in all the standard 32-bit calling conventions so those are the registers you can use for scratch space without saving/restoring.

Then the computation is done using %eax (I suppose this is for performance reasons?

No, it's because imul doesn't have a memory-destination encoding, only r/m source.

If you cared about performance, you wouldn't introduce store/reload latency into the dependency chain for the product you're computing. Even keeping the loop counter in memory would be better (for large counts at least) because dec latency is shorter than imul latency.

In your loop, one input for each iteration comes from the previous iteration. The total latency of store/load/imul / store/load/imul / ... is what bottlenecks your loop on an out-of-order execution CPU (all current x86 CPUs are like that). Store-forwarding makes store/reload only about 5 cycles, but that's still longer than imul latency on most CPUs.

If you're running low on registers and want to keep one of your variables in memory, generally prefer spilling a variable that only read inside a loop, not written. Re-reading the same memory location multiple times is fine.

In compiler/optimization terminology, "spilling" a variable from a register is when you can't do the preferred thing of keeping it in a register, and instead of to store it to memory.

e.g. imul 8(%ebp), %eax as your loop body.

you don't need a tmp reg to load+operate on a value from memory, you only need a counter + accumulator.
the loop-carried dependency chain is only imul -> imul -> imul ... for the critical path.

You also didn't need to copy to -4(%ebp), you could have just modified 8(%ebp) where you already had this variable in memory.

According to the calling convention you "own" your stack args and can modify that copy of your variables. So even if you did want to modify a value in memory, you still didn't need to copy it.

Your loop structure is also inefficient, although the 3-cycle latency of imul (or worse on some older CPUs) will easily hide the overhead of running all those extra instructions vs. a simple dec %ecx / jnz loop_find_pow at the bottom of your loop. Why are loops always compiled into "do...while" style (tail jump)?

Look at optimized compiler output for a C version of your function!

How to remove "noise" from GCC/clang assembly output?

(update: this could be simplified further because we're guaranteed that n>0. So we could just directly use a do{}while(--n); loop structure. I missed that in the question originally and will leave that simplification as an exercise for the reader.)

unsigned find_pow(unsigned x, unsigned n) {
    unsigned tot = x;     // or should this be 1?  Possible off-by-1 error
    for(; n ; n--) {   // gcc fails to convert i=0 ; i<n ; i++ into a down-count
        tot *= x;
    }
    return tot;
}

GCC 9.1 -O3 -m32 -fno-tree-vectorize -march=skylake on Godbolt

find_pow:
        movl    4(%esp), %ecx
        movl    8(%esp), %eax
        movl    %ecx, %edx
        testl   %eax, %eax           # if(n==0) return
        je      .L1
.L3:                                 # do{
        imull   %ecx, %edx
        decl    %eax
        jne     .L3                  # }while(--n);
.L1:
        movl    %edx, %eax    # silly compiler, should have used tot=EAX in the first place
        ret

clang unrolls (unless you use -fno-unroll-loops). But it doesn't use multiple accumulators for the product to hide imul latency /facepalm. imul is pipelined with better throughput than latency on modern x86 CPUs, typically 3 cycle latency / 1 cycle throughput. See https://agner.org/optimize/.

But clang just repeats imul %ecx, %eax 8 times in its main loop which is basically useless. Multiplying into %edx and %eax in parallel, and then multiplying those at the end, would let two chains of imul run in parallel. Use at least 3 registers for best performance with large n. But practical values of n are relatively small (like even 2**32 will overflow a 32-bit integer), so large-n optimizations are only interesting if you want to use this to get the low bits of (x**n) % (2**32).

I think that's a safe optimization even with wraparound at 2^32. GCC auto-vectorizes with AVX2 vpmulld if available (-march=skylake) which wouldn't be safe if integer multiply wasn't associative.

Clang does unroll with multiple accumulators does when optimizing FP loops with -ffast-math (to make it treat FP add/mul as associative) so it's too bad it isn't doing that with integer math. Unrolling by 8 only makes sense if the compiler doesn't realize that n is probably usually small, so if clang is going to optimize for large n, do it right and hide imul latency.

When auto-vectorizing with AVX2, clang does get this somewhat right and uses 4 vectors of totals in parallel. But vpmulld has 10 cycle latency on Haswell/Skylake and clang unrolls by 32, not just 4. Seems like way too much.

Cannot say I understood everything you said but I like this feeling. Thanks! — , Jul 26 '19 at 18:49
@DevashishJaiswal: yup, if there's interesting stuff to say, I tend to put it in my answer for the benefit of other future readers, and/or the original asker if they're still curious and want to dig deeper. :) Throughput vs. latency is one of the most important concepts when optimizing for an out-of-order execution machine like modern x86. (Or even a pipelined in-order CPU like many low-power ARM chips.) — Peter Cordes, Jul 26 '19 at 19:19

Redundant value copying in assembly?

2 Answers2

Related