How does the GNU toolchain decide to use near vs. short jump instructions?

Question

I've got some code that gcc (4.8.5, if it matters) compiles into nearly identical binary on two different machines except for one spot, where something in the toolchain on one machine decides to use a "near" JE instruction while the toolchain on the other machine decides to use a "short" JE instruction:

41e274:   85 ed                   test   %ebp,%ebp          41e274:   85 ed                   test   %ebp,%ebp
41e276:   0f 84 14 00 00 00       je     41e290         |   41e276:   74 18                   je     41e290
41e27c:   31 ed                   xor    %ebp,%ebp      |   41e278:   31 ed                   xor    %ebp,%ebp
41e27e:   48 81 c4 30 01 00 00    add    $0x130,%rsp    |   41e27a:   48 81 c4 30 01 00 00    add    $0x130,%rsp
41e285:   89 e8                   mov    %ebp,%eax      |   41e281:   89 e8                   mov    %ebp,%eax
41e287:   5b                      pop    %rbx           |   41e283:   5b                      pop    %rbx
41e288:   5d                      pop    %rbp           |   41e284:   5d                      pop    %rbp
41e289:   41 5c                   pop    %r12           |   41e285:   41 5c                   pop    %r12
41e28b:   c3                      retq                  |   41e287:   c3                      retq
41e28c:   0f 1f 40 00             nopl   0x0(%rax)      |   41e288:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,
                                                        >   41e28f:   00
41e290:   48 8d 7c 24 60          lea    0x60(%rsp),%rdi    41e290:   48 8d 7c 24 60          lea    0x60(%rsp),%rdi

I'm assuming this is some sort of optimization (though I'm unclear as to whether it's the compiler or assembler that is being clever).

The odd thing, though, is that (in the same object file) there are other "nearby je branches" (to less than 40 or so bytes away) that are not similarly optimized to the single byte opcode:

c320:   45 85 ed                test   %r13d,%r13d          c320:   45 85 ed                test   %r13d,%r13d
c323:   0f 84 17 00 00 00       je     c340                 c323:   0f 84 17 00 00 00       je     c340
c329:   45 31 ed                xor    %r13d,%r13d          c329:   45 31 ed                xor    %r13d,%r13d
c32c:   48 81 c4 30 01 00 00    add    $0x130,%rsp          c32c:   48 81 c4 30 01 00 00    add    $0x130,%rsp
c333:   44 89 e8                mov    %r13d,%eax           c333:   44 89 e8                mov    %r13d,%eax
c336:   5b                      pop    %rbx                 c336:   5b                      pop    %rbx
c337:   5d                      pop    %rbp                 c337:   5d                      pop    %rbp
c338:   41 5c                   pop    %r12                 c338:   41 5c                   pop    %r12
c33a:   41 5d                   pop    %r13                 c33a:   41 5d                   pop    %r13
c33c:   41 5e                   pop    %r14                 c33c:   41 5e                   pop    %r14
c33e:   c3                      retq                        c33e:   c3                      retq   
c33f:   90                      nop                         c33f:   90                      nop 
c340:   4c 8d 74 24 60          lea    0x60(%rsp),%r14      c340:   4c 8d 74 24 60          lea    0x60(%rsp),%r14

Is there some parameter/configuration for gcc (or ld) that I can use to control this? I'd like to ensure that my code results in the same binary when compiled elsewhere.

Interesting that it doesn’t turn out to shorten the code any, since it just ends up adding 4 bytes to the nop. — prl, Mar 17 '20 at 04:33
Branch shortening is done by the assembler, not the compiler. What version of binutils do you have? — prl, Mar 17 '20 at 04:35
The first version is a non-PIE executable. The 2nd is either an unlinked executable or a PIE executable. You're showing identical code in both columns? Also, `4c 8d 74 24 60` is an `lea` into `%r14`, not `%r1` (which is not a standard register name in the first place). I think you cut off the last character of a line you pasted into something. (editing to fix that.) — Peter Cordes, Mar 17 '20 at 05:07
Normally gas is pretty good at optimizing jmp lengths. The common algorithm usually does very well, even though it can miss optimizations around `.p2align` blocks: [Why is the "start small" algorithm for branch displacement not optimal?](https://stackoverflow.com/q/34911142) — Peter Cordes, Mar 17 '20 at 05:42
@prl I think you've hit on it -- I've been focused on the compiler version (assuming the assembler was bundled). binutils versions are different: binutils-2.27-41.base.el7_7.`1`.x86_64 v.s. binutils-2.27-41.base.el7_7.`2`.x86_64 (Centos7). — jhfrontz, Mar 17 '20 at 15:08
@PeterCordes in both cases, the object files are the output from `libtool`; thanks for the fix (I must've missed the last character when I did the select prior to copying). — jhfrontz, Mar 17 '20 at 15:16
Comment from the `changelog` of [the .1.->.2 commit](https://git.centos.org/rpms/binutils/c/73f23b2765390f847945f8c54ce48437ca15346a?branch=c7): `Implement assembler workaround for Intel JCC microcode bug.` — jhfrontz, Mar 17 '20 at 15:20
Ah, yeah that probably explains it. But the short version was fine, I think; the macro-fused test/je didn't sit on or span a 32-byte uop-cache-line boundary. (https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf). And it's not re-aligning any later jcc instructions in the same block, or even the `ret`. Making it longer was unnecessary. Looks like a missed-optimization bug introduced by that change, if that's what did it. — Peter Cordes, Mar 17 '20 at 15:32
@prl if you want to write-up an answer pointing out my flawed assumption, I'll accept it (if that's the right thing to do?) — jhfrontz, Mar 17 '20 at 16:04

How does the GNU toolchain decide to use near vs. short jump instructions?

0 Answers0

Linked