0

So, I'm trying to get familiar with assembly and trying to reverse-engineer some code. My problem lies in trying to decode addq which I understands performs Source + Destination= Destination.

I am using the assumptions that parameters x, y, and z are passed in registers %rdi, %rsi, and %rdx. The return value is stored in %rax.

long someFunc(long x, long y, long z){
1.  long temp=(x-z)*x;
2.  long temp2= (temp<<63)>>63;
3.  long temp3= (temp2 ^ x);
4.  long answer=y+temp3; 
5.  return answer;
}

So far everything above line 4 is exactly what I am wanting. However, line 4 gives me leaq (%rsi,%rdi), %rax rather than addq %rsi, %rax. I'm not sure if this is something I am doing wrong, but I am looking for some insight.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I fixed the code in my question. I ended up shortening the temps and such. – letsgetphysical Oct 21 '19 at 21:51
  • 1
    For me `gcc -O0` or `gcc -O3 -march=atom` produces an `add`. – Jester Oct 21 '19 at 21:53
  • @Jester: Interesting. But instead of doing better register allocation, it uses an extra `mov` before `add`. That might still be better than LEA on in-order Atom; it runs LEA on the actual AGU hardware earlier in the pipeline, so inputs need to be ready sooner, like a couple cycle IIRC. So in this chain of dependent operations, it could introduce a 2 cycle stall vs. just 1 extra cycle of `mov` latency. – Peter Cordes Oct 21 '19 at 22:46

1 Answers1

0

Those instructions aren't equivalent. For LEA, rax is a pure output. For your hoped-for add, it's rax += rsi so the compiler would have to mov %rdi, %rax first. That's less efficient so it doesn't do that.

lea is a totally normal way for compilers to implement dst = src1 + src2, saving a mov instruction. In general don't expect C operators to compile to instruction named after them. Especially small left-shifts and add, or multiply by 3, 5, or 9, because those are prime targets for optimization with LEA. e.g. lea (%rsi, %rsi, 2), %rax implements result = y*3. See Using LEA on values that aren't addresses / pointers? for more. LEA is also useful to avoid destroying either of the inputs, if they're both needed later.


Assuming you meant t3 to be the same variable as temp3, clang does compile the way you were expecting, doing a better job of register allocation so it can use a shorter and more efficient add instruction without any extra mov instructions, instead of needing lea.

Clang chooses to do better register allocation than GCC so it can just use add instead of needing lea for the last instruction. (Godbolt). This saves code-size (because of the indexed addressing mode), and add has slightly better throughput than LEA on most CPUs, like 4/clock instead of 2/clock.

Clang also optimized the shifts into andl $1, %eax / negq %rax to create the 0 or -1 result of that arithmetic right shift = bit-broadcast. It also optimized to 32-bit operand-size for the first few steps because the shifts throw away all but the low bit of temp1.

# side by side comparison, like the Godbolt diff pane
clang:                           |    gcc:
        movl    %edi, %eax              movq    %rdi, %rax
        subl    %edx, %eax              subq    %rdx, %rdi
        imull   %edi, %eax              imulq   %rax, %rdi  # temp1

        andl    $1, %eax                salq    $63, %rdi
        negq    %rax                    sarq    $63, %rdi   # temp2

        xorq    %rdi, %rax              xorq    %rax, %rdi  # temp3

        addq    %rsi, %rax              leaq    (%rdi,%rsi), %rax   # answer
        retq                            ret

Notice that clang chose imul %edi, %eax (into RAX) but GCC chose to multiply into RDI. That's the difference in register allocation that leads to GCC needing an lea at the end instead of an add.

Compilers sometimes even get stuck with an extra mov instruction at the end of a small function when they make poor choices like this, if the last operation wasn't something like addition that can be done with lea as a non-destructive op-and-copy. These are missed-optimization bugs; you can report them on GCC's bugzilla.


Other missed optimizations

GCC and clang could have optimized by using and instead of imul to set the low bit only if both inputs are odd.

Also, since only the low bit of the sub output matters, XOR (add without carry) would have worked, or even addition! (Odd+-even = odd. even+-even = even. odd+-odd = odd.) That would have allowed an lea instead of mov/sub as the first instruction.

        lea    (%rdi,%rsi), %eax
        and    %edi, %eax           # low bit matches (x-z)*x

        andl    $1, %eax            # keep only the low bit
        negq    %rax                # temp2

Lets make a truth table for the low bits of x and z to see how this shakes out if we want to optimize more / differently:

# truth table for low bit: input to shifts that broadcasts this to all bits
 x&1  |  z&1  | x-z = x^z | x*(x-z) = x & (x-z)
  0       0          0       0
  0       1          1       0
  1       0          1       1
  1       1          0       0
                            x & (~z) = BMI1 andn

So temp2 = (x^z) & x & 1 ? -1 : 0. But also temp2 = -((x & ~z) & 1). We can rearrange that to -((x&1) & ~z) which lets us start with not z and and $1, x in parallel, for better ILP. Or if z might be ready first, we could do operations on it and shorten the critical path from x -> answer, at the expense of z.

Or with a BMI1 andn instruction which does (~z) & x, we can do this in one instruction. (Plus another to isolate the low bit)

I think this function has the same behaviour for every possible input, so compilers could have emitted it from your source code. This is one possibility you should wish your compiler emitted:

# hand-optimized
# long someFunc(long x, long y, long z)
someFunc:
        not    %edx                   # ~z
        and    $1, %edx
        and    %edi, %edx             # x&1 & ~z = low bit of temp1
        neg    %rdx                  # temp2 = 0 or -1

        xor    %rdi, %rdx            # temp3 = x or ~x

        lea    (%rsi, %rdx), %rax    # answer = y + temp3
        ret

So there's still no ILP, unless z is ready before x and/or y. Using an extra mov instruction, we could do x&1 in parallel with not z

Possibly you could do something with test/setz or cmov, but IDK if that would beat lea/and (temp1) + and/neg (temp2) + xor + add.

I haven't looked into optimizing the final xor and add, but note that temp3 is basically a conditional NOT of x. You could maybe improve latency at the expense of throughput by calculating both ways at once and selecting between them with cmov. Possibly by involving the 2's complement identity that -x - 1 = ~x. Maybe improve ILP / latency by doing x+y and then correcting that with something that depends on the x and z condition? Since we can't subtract using LEA, it seems best to just NOT and ADD.

# return y + x  or   y + (~x) according to the condition on x and z
someFunc:
        lea    (%rsi, %rdi), %rax     # y + x

        andn   %edi, %edx, %ecx       # ecx = x & (~z)

        not    %rdi                   # ~x
        add    %rsi, %rdi             # y + (~x)

        test    $1, %cl
        cmovnz  %rdi, %rax            # select between y+x and y+~x
        retq

This has more ILP, but needs BMI1 andn to still be only 6 (single-uop) instructions. Broadwell and later have single-uop CMOV; on earlier Intel it's 2 uops.

The other function could be 5 uops using BMI andn.

In this version, the first 3 instructions can all run in the first cycle, assuming x,y, and z are all ready. Then in the 2nd cycle, ADD and TEST can both run. In the 3rd cycle, CMOV can run, taking integer inputs from LEA, ADD, and flag input from TEST. So the total latency from x->answer, y->answer, or z->answer is 3 cycles in this version. (Assuming single-uop / single-cycle cmov). Great if it's on the critical path, not very relevant if it's part of an independent dep chain and throughput is all that matters.

vs. 5 (andn) or 6 cycles (without) for the previous attempt. Or even worse for the compiler output using imul instead of and (3 cycle latency just for that instruction).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • So, if I was trying to get the equivalent line from Clang in gcc is there anything that can be done differently? – letsgetphysical Oct 21 '19 at 22:51
  • @letsgetphysical: not really, unless you want to dig into GCC's source code and figure out how to teach its register allocation code to find this missed-optimization. Or Jester points out that `-mtune=atom` will make it use `mov` to RAX + `add`, but I don't see how that's any different from `lea`. And it will make GCC *prefer* lea over add in other cases. If you care about this level of detail in how your C turns into asm, maybe you need to do it yourself by hand. I'm not sure what your goal is, or why you'd care. You could maybe hack GCC to never use LEA and make slower code... – Peter Cordes Oct 21 '19 at 23:03
  • I'm just trying to ensure that I'm interpreting the assembly asm correctly. – letsgetphysical Oct 21 '19 at 23:07
  • @letsgetphysical: Compilers will frequently use LEA as a copy-and-add. Don't expect an actual `add` instruction; in other cases it's *not* the best choice. – Peter Cordes Oct 21 '19 at 23:25
  • Understood. I ended up removing some of my variable declaration statements and instead changed the values of x, y, and z and it changed to addq. – letsgetphysical Oct 21 '19 at 23:48
  • @letsgetphysical: updated my answer to say more clearly that you should not in general expect C operators to compile to the asm instruction of the same name. And added some hand-optimized versions that significantly transform the logic, partly for the sake of it as a demonstration, and also to create a lower-latency version. – Peter Cordes Oct 22 '19 at 01:05