x86_64 - Assembly - loop conditions and out of order

Question

I am not asking for a benchmark.

(If that was the case, I would have done it myself.)

My question:

I tend to avoid the indirect/index addressing modes for convenience.

As a replacement, I often use immediate, absolute or register addressing.

The code:

; %esi has the array address. Say we iterate a doubleword (4bytes) array.
; %ecx is the array elements count
(0x98767) myloop:
    ... ;do whatever with %esi
    add $4, %esi
    dec %ecx
    jnz 0x98767;

Here, we have a serialized combo(dec and jnz) which prevent proper out of order execution (dependency).

Is there a way to avoid that / break the dep? (I am not an assembly expert).

So let me get this straight: you want a conditional jump, which depends on the outcome of the previous instruction, to be executable out-of-order with that instruction? I think this is logically impossible. — davmac, Aug 02 '15 at 11:41
Also note `dec` is not recommended because it causes partial flags update stall. — Jester, Aug 02 '15 at 11:53
@davmac: my goal is to not depend on the previous instruction — Kroma, Aug 02 '15 at 12:06
@Kroma do you mean you want to re-order the `dec` and the `add`? In that case can you not use `jcxz`? (You can't make a conditional jump not-dependent on the instruction which produces the condition). — davmac, Aug 02 '15 at 12:13
You can use `lea 4(%esi),%esi` for the addition and that doesn't affect flags, so you can insert a `subl $1, %ecx` higher up. As @davmac says, you can't get rid of the dependency unless you use the `loop` instruction which is again not recommended. — Jester, Aug 02 '15 at 12:17
@davmac: yes I don't feel obliged to use the conditional jumps if there is a better solution — Kroma, Aug 02 '15 at 12:20
Also be sure to unroll the loop if possible, to amortize the cost of the loop overhead. — Jester, Aug 02 '15 at 12:53
@Jester: absolutely, but the length is variable. Nice tip though. Must take care of the cache line length. — Kroma, Aug 02 '15 at 13:22
@davmac: I wouldn't recommend `jcxz`, unless that lets you avoid a `test` or `cmp` instruction. On Intel CPUs, it's a 2-uop instruction. (Less of a big deal when code is in the uop cache, otherwise it can slow down decoding because it can only be handled by the complex decoder.) — Peter Cordes, Aug 03 '15 at 01:57
@jester: `dec` is fine when it macro-fuses with the following branch (on Intel CPUs.) AMD CPUs also avoid partial-flag stalls by treating separate bits of the flags as independent. (I haven't benchmarked AMD, or the non-macro-fused case on Intel, though.) — Peter Cordes, Aug 03 '15 at 02:06

Peter Cordes · Accepted Answer · 2020-07-23T06:07:45.423

When optimizing for Intel CPUs, always put the flag-setting instruction right before the conditional jump instruction (if it's one of the simple ones listed in the table below), so they can macro-fuse into one uop in the decoders.

Doing this is not significantly worse for older CPUs that don't do macro-fusion. Putting the flag-setting earlier might shorten the branch mispredict penalty by one for such CPUs, but out-of-order execution means that moving the dec a couple instruction earlier won't make a real difference. See also Avoid stalling pipeline by calculating conditional early. To really make a difference, you do stuff like unroll the loop and/or branch on something that can be calculated more simply, ideally without a dependency on a slow input, so OoO exec can have the branch already resolved while working on older iterations of the loop body. i.e. the loop counter dep-chain can run ahead of the main work.

I don't have benchmarks, but I don't think the small downside on increasingly-rare CPUs justifies missing out on the front-end throughput benefit (decode and issue) for CPUs that do fusion. Total uop throughput can often be a bottleneck.

AMD Bulldozer/Piledriver/Steamroller can fuse test/cmp with any jcc, but only test/cmp, not any other ALU instructions. So definitely put compares with branches. It's still valuable for Intel CPUs to put other things with branches if they can macro-fuse on sandybridge-family.

From Agner Fog's microarch guide, Table 9.2 (for Sandybridge / Ivybridge):

First       | can pair with these  |  cannot pair with
instruction | (and the inverse)    |
---------------------------------------------
cmp         |jz, jc, jb, ja, jl, jg|   js, jp, jo
add, sub    |jz, jc, jb, ja, jl, jg|   js, jp, jo
adc, sbb    |none                  |
inc, dec    |jz, jl, jg            |   jc, jb, ja, js, jp, jo
test        | all                  |
and         | all                  |
or, xor, not, neg | none           |
shift, rotate     | none           |

Table 9.2. Instruction fusion

So basically, inc/dec can macro-fuse with a jcc as long as the condition only depends on bits that are modified by inc/dec.

(Otherwise, they don't macro-fuse, and you get an extra uop inserted to merge the flags (like when you read eax after writing al). Or on earlier CPUs, a partial-flags stall.)

Core2 / Nehalem was more limited in macro-fusion capability (just for CMP/TEST with more limited JCC combinations), and Core2 couldn't macro-fuse in 64bit mode at all.

Read Agner Fog's optimizing asm and C guides, too, if you haven't already. They're full of essential knowledge.

Thanks a lot Peter, I already had the "instruction tables" and "optimize assembly" from him. I didn't read the latter completely though (hence my ignorance), BUT I WILL do it now. Thanks Peter :) — Kroma, Aug 03 '15 at 12:32
@Kroma: This table is from the microarchitecture.pdf. I forget if he mentions macro-fusion in the optimize asm guide, but probably at least mentions it. — Peter Cordes, Aug 03 '15 at 18:01
Probably worth mention the first instruction and second instruction have to be in the same 16byte decoding segment for this to work (have the same address rounded down to 16 bytes). Think Agner Fog mentions that somewhere. — Noah, Jan 14 '21 at 05:46
@Noah: first, decode groups aren't always aligned. Second, Sandybridge-family hangs on to the last instruction in a group if it's a fusion candidate, in case the first instruction in the next group is a branch. (So it sacrifices some legacy-decode throughput to maybe build more compact uop-cache lines, and to maybe minimize ROB space and other back-end resource consumption this time through). I think this has been discussed on SO somewhere, but nothing specific comes to mind. Still, you might find something with google. — Peter Cordes, Jan 14 '21 at 08:52

x86_64 - Assembly - loop conditions and out of order

I am not asking for a benchmark.

1 Answers1

Linked