Understanding optimized assembly code generated by gcc

Question

I'm trying to understand what kind of optimizations are performed by gcc when -O3 flag was set. I'm quite confused what these two lines,

xor %esi, %esi
lea 0x0(%esi), %esi

It seems to me redundant. What's point to use lea instruction here?

Are you looking at disassembly output of an unlinked .o file, or raw `-S` output? If the latter, I don't know. If the former, that's probably where the offset goes after the linker computes it. — torek, Sep 30 '13 at 02:41
Ah, in that case, perhaps it's because the linker calculated the offset at link time, and it turned out the offset was 0. I'd have to look at the source and/or intermediate `-S` output to be sure. — torek, Sep 30 '13 at 02:45
@torek The assembly output from `-S` has not been passed to a linker. @REALFREE Is this followed by a loop? My guess is that it is just a multi-byte no-op for alignment purposes. — ughoavgfhw, Sep 30 '13 at 03:18
@ughoavgfhw: exactly -- so if it's originally something like `lea foo(%esi)`, we could immediately tell that it's trying to add a symbol value, and later the linker filled in `0` as the value of `foo`. Or, if I misinterpreted REALFREE's reply, and it's disassembly of an as-yet un-linked .o file, the `0` is because the linker has not yet filled in `foo` at all. — torek, Sep 30 '13 at 03:47
@torek I think you misunderstood me. The output of `-S` is unlinked, so the linker could not have replaced foo with an address. It has also not been disassembled, so there cannot be a relocation for it. That line simply does nothing. — ughoavgfhw, Sep 30 '13 at 04:23
@ughoavgfhw: but according to REALFREE that's the output of a *disassembler*, not the assembly *input*. Here's an `objdump --disassemble x.o` line: `6: 03 05 00 00 00 00 add 0x0,%eax`. The 0 is going to be replaced by something later. — torek, Sep 30 '13 at 05:17
@torek Oh, my bad. I misread his comment as "it is assembly output of compiler". So it could be just padding, or a relocation. You should add an answer. — ughoavgfhw, Sep 30 '13 at 17:50
@ughoavgfhw: but, it *could* be a no-op, as you say (makes sense if it's the top of a loop). Context does help :-) — torek, Oct 01 '13 at 01:10

score 4 · Accepted Answer · answered Sep 30 '13 at 04:36

That instruction is used to fill space for alignment purposes. Loops can be faster when they start on aligned addresses, because the processor loads memory into the decoder in chunks. By aligning the beginnings of loops and functions, it becomes more likely that they will be at the beginning of one of these chunks. This prevents previous instructions which will not be used from being loaded, maximizes the number of future instructions that will, and, possibly most importantly, ensures that the first instruction is entirely in the first chunk, so it does not take two loads to execute it.

The compiler knows that it is best to align the loop, and has two options to do so. It can either place a jump to the beginning of the loop, or fill the gap with no-ops and let the processor flow through them. Jump instructions break the flow of instructions and often cause wasted cycles on modern processors, so adding them unnecessarily is inadvisable. For a short distance like this no-ops are better.

The x86 architecture contains an instruction specifically for the purpose of doing nothing, nop. However, this is one byte long, so it would take more than one to align the loop. Decoding each one and deciding it does nothing takes time, so it is faster to simply insert another longer instruction that has no side effects. Therefore, the compiler inserted the lea instruction you see. It has absolutely no effects, and is chosen by the compiler to have the exact length required. In fact, recent processors have standard multi-byte no-op instructions, so this will likely be recognized during decode and never even executed.

score 1 · Answer 2 · edited May 23 '17 at 11:57

As explained by ughoavgfhw - these are paddings for better code alignment. You can find this lea in the following link -

http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-September/003881.html

quote:

  1-byte: XCHG EAX, EAX
  2-byte: 66 NOP
  3-byte: LEA REG, 0 (REG) (8-bit displacement)
  4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
  5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
**6-byte: LEA REG, 0 (REG) (32-bit displacement)**
  7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
  8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
  9-byte: NOP WORD  PTR [EAX + EAX*1 + 0] (32-bit displacement)

Also note this SO question describing it in more details - What does NOPL do in x86 system?

Note that the xor itself is not a nop (it changes the value of the reg), but it is also very cheap to perform since it's a zero idiom - What is the purpose of XORing a register with itself?

Understanding optimized assembly code generated by gcc

2 Answers2