1

I have checked uops table (https://uops.info/table.html) and I found that TP for jmp rel8 is far greater than for jmp rel32. Does this mean that jmp rel8 is slower than jmp rel32 ?

jmp rel32

With unroll_count=500 and no inner loop

    Code:

       0:   e9 00 00 00 00          jmp    0x5

    Show nanoBench command
    Results:
        Instructions retired: 1.0
        Core cycles: 2.75
        Reference cycles: 2.05

jmp rel8

With unroll_count=500 and no inner loop

    Code:

       0:   eb 00                   jmp    0x2

    Show nanoBench command
    Results:
        Instructions retired: 1.0
        Core cycles: 5.84
        Reference cycles: 4.61
Sep Roland
  • 33,889
  • 7
  • 43
  • 76
HelloGUI
  • 121
  • 7

1 Answers1

3

That's not a very representative measurement. One per 2 cycle throughput is normal for taken branches, or 1/clock for loop branches in tiny loops. But branch prediction can do worse with more branches per 16-byte block of code depending on the microarchitecture, so packing jmp next_instruction (jmp rel8=0) is bad. (Especially when you put 500 of them in a row, like in Slow jmp-instruction)

That 5.84 number looks like Alder Lake P-cores. They came up with different numbers for other uarches; it matters a lot which architecture you look at for something this low-level.

Anyway, I think the key point here is that https://uops.info/ doesn't benchmark taken jumps very well; they use the same test harness as for other instructions (unroll a lot of times), leading to poor results that don't really characterize it well.

Agner Fog's instruction tables report different numbers (https://agner.org/optimize/), e.g. 1-2 cycle throughput for relative jmp on Skylake and Ice Lake, same as most earlier Intel. That's realistic if you have jumps inside a loop, so it's the same few jump instructions that execute in sequence.

But uops.info measured 2.12c or 4.80c for Skylake, way higher, something you hopefully only run into with artificial microbenchmarks.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • About arch yes, but about all the arch has this problem (high for rel8 and low for rel32) except `AMD ZEN4` ... I think ZEN4 is going to be the best arch ... xD – HelloGUI May 23 '23 at 06:10
  • But I got U. You are telling that the problem is about a row test, but when we put it in the codes, there is no different ... Right ? – HelloGUI May 23 '23 at 06:12
  • @HelloGUI I don't know what you mean by "a row test". As for Zen 4, yes, it's a good microarchitecture, and apparently its front-end branch-prediction and decoders, and maybe uop-cache, can handle dense branches. Intel's uop cache ends a line on taken branches, so this would all have to run from legacy decode. I don't know if Zen 4 is like that, or if they perhaps special-case `jmp +0` as a no-op. – Peter Cordes May 23 '23 at 06:16
  • How about aligning, just to convert a `SHORT jmp` to `LONG jmp` ? Because as we talked in last question, GCC even turns `SHORT jmp` to `LONG jmp` and you said "GCC's heuristics aren't perfect for all situations" … Is it possible that GCC aligns jmp-points to have 16-BYTE alignment advantage also, turning `SHORT jmp` to `LONG jmp` if possible (because maybe `rel32 jmp` works better ?) ?? – HelloGUI May 23 '23 at 06:22
  • 1
    No, `jmp rel32` is almost always worse (waste of code size), and compilers are never *trying* to waste space to make the assembler use a longer jump. GCC doesn't "turn short jumps into long jumps", it just doesn't try hard to avoid it when deciding to whether to align something or not. It doesn't track instruction sizes in bytes so it doesn't know exactly where it is, it leaves that to the assembler. And assemblers (including GAS) will use `jmp rel8` when possible: [Why is the "start small" algorithm for branch displacement not optimal?](https://stackoverflow.com/q/34911142) – Peter Cordes May 23 '23 at 06:45