59

Sometimes compilers generate code with weird instruction duplications that can safely be removed. Consider the following piece of code:

int gcd(unsigned x, unsigned y) {
  return x == 0 ? y : gcd(y % x, x);
}

Here is the assembly code (generated by clang 5.0 with optimizations enabled):

gcd(unsigned int, unsigned int): # @gcd(unsigned int, unsigned int)
  mov eax, esi
  mov edx, edi
  test edx, edx
  je .LBB0_1
.LBB0_2: # =>This Inner Loop Header: Depth=1
  mov ecx, edx
  xor edx, edx
  div ecx
  test edx, edx
  mov eax, ecx
  jne .LBB0_2
  mov eax, ecx
  ret
.LBB0_1:
  ret

In the following snippet:

  mov eax, ecx
  jne .LBB0_2
  mov eax, ecx

If the jump doesn't happen, eax is reassigned for no obvious reason.

The other example is two ret's at the end of the function: one would perfectly work as well.

Is the compiler simply not intelligent enough or there's a reason to not remove the duplications?

Ignat Loskutov
  • 781
  • 4
  • 9

2 Answers2

42

Compilers can perform optimisations that are not obvious to people and removing instructions does not always make things faster.

A small amount of searching shows that various AMD processors have branch prediction problems when a RET is immediately after a conditional branch. By filling that slot with what is essentially a no-op, the performance problem is avoided.

Update:

Example reference, section 6.2 of the "Software Optimization Guide for AMD64 Processors" (see http://support.amd.com/TechDocs/25112.PDF) says:

Specifically, avoid the following two situations:

  • Any kind of branch (either conditional or unconditional) that has the single-byte near-return RET instruction as its target. See “Examples.”

  • A conditional branch that occurs in the code directly before the single-byte near-return RET instruction.

It also goes into detail on why jump targets should have alignment which is also likely to explain the duplicate RETs at the end of the function.

janm
  • 17,976
  • 1
  • 43
  • 61
  • @Cornstalks [this](http://mikedimmick.blogspot.com.br/2008/03/what-heck-does-ret-mean.html) my be one. – Mário Feroldi Dec 29 '17 at 04:51
  • @Cornstalks Updated with details. – janm Dec 29 '17 at 05:01
  • 9
    Do you have evidence that the compilers are indeed doing it on purpose for this reason? If they want a NOP before the RET, this MOV is far from the simplest they could have gone for. If this only affects AMD cpus, -mtune=intel or similar should remove it, but it doesn't. – Marc Glisse Dec 29 '17 at 09:07
  • @MarcGlisse Is your question whether I have evidence that optimisation tools are optimising on purpose rather than by accident? No. It could just be a bug that makes code go faster. Do you have evidence that the generated code is slower on Intel CPUs without the instruction removed? – janm Dec 29 '17 at 13:16
  • 2
    I have evidence that the code is longer and still generated at -Os --> bug. – Marc Glisse Dec 29 '17 at 14:38
  • 2
    Is there no *real* NO-OP instruction? Why use an instruction that's only "essentially" a NO-OP? – Barmar Jan 02 '18 at 19:20
  • Looking at the assembly, my first instinct was that this `mov` following the `jne` is some sort of a [**delay slot**](https://en.m.wikipedia.org/wiki/Delay_slot). That would explain why the compiler generates it even for non-AMD targets. – TheCodeArtist Jan 03 '18 at 13:20
  • @Barmar not really really real, there's just one "essentially a noop" instruction that happens to have the mnemonic `NOP` (it's `XCHG AX, AX`) – hobbs Jan 04 '18 at 01:44
  • 1
    @TheCodeArtist Yes, I agree -- I thought about that, then found the AMD harder requirement. The real point is that scheduling on modern CPUs is not just about short code. There is all kinds of pipeline maintenance required. – janm Jan 04 '18 at 01:52
5

Any compiler will have a bunch of transformations for register renaming, unrolling, hoisting, and so on. Combining their outputs can lead to suboptimal cases such as what you have shown. Marc Glisse offers good advice: it's worth a bug report. You are describing an opportunity for a peephole optimizer to discard instructions that either

  • don't affect the state of registers & memory at all, or
  • don't affect state that matters for a function's post-conditions, won't matter for its public API.

Sounds like an opportunity for symbolic execution techniques. If the constraint solver finds no branch points for a given MOV, perhaps it is really a NOP.

J_H
  • 17,926
  • 4
  • 24
  • 44