2

I'd like to write a load benchmark that strides across a given region of memory with a compile-time known stride, wrapping at the end of the region (power-of-2) with as few non-load instructions as possible.

For example given a stride of 4099, and the iteration count in rdi and the pointer to the memory region in rsi something that "works" is:

%define STRIDE 4099
%define SIZE   128 * 1024
%define MASK   (SIZE - 1)
xor     ecx, ecx

.top:
mov      al, [rsi + rcx]
add     ecx, STRIDE
and     ecx, MASK
dec rdi
jnz .top

The problem is that there are 4 non-load instructions just to support the single load, dealing with the stride addition, masking and loop termination check. Also, there is a 2-cycle dependency chain carried through ecx.

We can unroll this a bit to reduce the loop termination check cost to near zero, and break up the dependency chain (here unrolled by 4x):

.top:

lea     edx, [ecx + STRIDE * 0]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

lea     edx, [ecx + STRIDE * 1]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

lea     edx, [ecx + STRIDE * 2]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

lea     edx, [ecx + STRIDE * 3]
and     edx, MASK
movzx   eax, BYTE [rsi + rdx]

add     ecx, STRIDE * 4

dec rdi
jnz .top

However, this it doesn't help for the add and and operations dealing with the wrapping the stride. For example, the above benmark reports 0.875 cycles/load for an L1-contained region, but we know the right answer is 0.5 (two loads per cycle). The 0.875 comes from 15 total total uops / 4 uops cycle, i.e., we are bound by the 4-wide maximum width of uop throughput, not on load throughput.

Any idea on how to a good way to effectively unroll the loop to remove most of the cost of the stride calculation?

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • I imagine the CPU will end up scheduling these instructions in parallel with the loads. – Oliver Charlesworth Dec 24 '17 at 00:22
  • @OliverCharlesworth - sometimes yes, but it can't hide the latency of the calculations in original example when the loads have a throughput of 1 cycle each, but the dependency chain through `ecx` takes 2 cycles, so it's actually the arithmetic that dominates the loop when the region fits in L1 or L2. The unrolled version is better, but I'd still like fewer instructions (in particular, not all archs are wide enough to schedule all the other instructions in parallel). – BeeOnRope Dec 24 '17 at 00:24
  • I'm not sure if you're going to be stymied by latency rather than throughput - due to the WAW hazards from writing to `al` each time. – Oliver Charlesworth Dec 24 '17 at 00:43
  • For certain strides you may loop till `and ecx,MASK` is zero, and use iteration counter after that (if this is benchmark-like and the stride vs mask hits interesting pattern of repetition). And for compile-time stride by hand you can unroll the loop until first mask is needed (doing only `add` before) ... then this unrolled loop with one more `and` on previous `add` will by my very quick and inaccurate guesstimate be good enough to loop over, to never leave designed area, but math proof / verification needed. – Ped7g Dec 24 '17 at 00:48
  • @OliverCharlesworth, you are right, I should use `movzx` since otherwise I'll never get more than 1 load per cycle! As mentioned above though the first (not unrolled loop) is limited to 1 load per 2 cycles due to the `add`, `and` chain involving ecx. – BeeOnRope Dec 24 '17 at 00:48
  • Oh, to clarify, what I meant was that the CPU will presumably stall a write to a register until a previous write to that register has completed. So then you'd be rate-limited by L2 latency, not L2 throughput. And IIRC, L2 latency (even for a hit) is more than 1 cycle. – Oliver Charlesworth Dec 24 '17 at 00:53
  • @OliverCharlesworth - well you were right in a roundabout way :). On a modern OoO CPU there is no WAW hazard as you suggest: each instance of `eax` that appears in the source will generally be renamed to a distinct _physical register_ (of which there are ~200 each of GP and SSE/AVX on modern Intel), so internally the operations can proceed independently. This is easy to show, for example, with a code that can overwrite the same register 4 times in one cycle. However, in the special case of x86 byte regs like `al` this can't occur because the high bits aren't zeroed, so merging occurs. – BeeOnRope Dec 24 '17 at 01:05
  • 1
    @Ped7g - that's a really good point. In fact I'm choosing `STRIDE` always a largeish prime, so it is always relatively prime to iteration count, so I could loop until `ecx` becomes zero (I would need to choose my unroll count carefully so that it actually occurs at the end of an iteration, I think it's enough that the unroll factor is a divisor of the iteration count). The awkward part is that it made to code simpler to have an explicit iteration count that was exactly hit (with no "tail handling" loop), but I could perhaps accommodate this strategy. – BeeOnRope Dec 24 '17 at 01:17
  • @OliverCharlesworth: On Skylake (and all non-Intel CPUs), [`mov al, [mem]` merges with the old value](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to), so microarchitecturally it's a true dependency. On Sandybridge and P6-family (like Core2/Nehalem), low8 partial registers are renamed separately from the full register, so writing `al` really is a write-only operation with no false dependency. [Register renaming + OoO execution](https://stackoverflow.com/a/45114487/224132) hides the WAW hazard as always. – Peter Cordes Dec 24 '17 at 01:27
  • @OliverCharlesworth: On Skylake at least, `mov al, [mem]` decodes as an ALU merge uop micro-fused with the load. After they both issue into the out-of-order scheduler, the load can begin execution before the input dependencies of the ALU merge are ready, so no, L2 latency isn't directly the problem. The load can start before the old value of `eax` is ready, exactly like with `add eax, [mem]` (which also only adds 1 cycle to the dependency chain involving `eax`, when the load address is ready soon enough.) – Peter Cordes Dec 24 '17 at 01:34
  • @BeeOnRope: BTW, `cmp` with an end_pointer saves one physical register vs. `dec` of a separate loop counter. Even better is if you can count an index up towards / crossing zero, but then you need indexed addressing modes. (Or `[disp32 + base]` absolute addressing). – Peter Cordes Dec 24 '17 at 01:40
  • @PeterCordes, yes thanks. In general I'm trying to keep my benchmarks simple always with the same loop structure (and anyways unrolling reduces the loop structure overhead to zero, asymptotically). In this example, there is no "end pointer" per se: the pointer jumps around in a pseudo-random way (like an LCG with a multiplier of 1), it is not striding forward to a specific end point. The important part is to solve the cost of the increment & masking, which the simple unrolling doesn't help. – BeeOnRope Dec 24 '17 at 01:43
  • @BeeOnRope - Ah, register renaming. Things are more complex than I remember :/ – Oliver Charlesworth Dec 24 '17 at 12:05

3 Answers3

2

For "absolute maximum insanity"; you can ask the OS to map the same pages at many virtual addresses (e.g. so the same 16 MiB of RAM appears at virtual addresses 0x100000000, 0x11000000, 0x12000000, 0x13000000, ...) to avoid the need to care about wrapping; and you can use self-generating code to avoid everything else. Basically, code that generates instructions that look like:

    movzx eax, BYTE [address1]
    movzx ebx, BYTE [address2]
    movzx ecx, BYTE [address3]
    movzx edx, BYTE [address4]
    movzx esi, BYTE [address5]
    movzx edi, BYTE [address6]
    movzx ebp, BYTE [address7]
    movzx eax, BYTE [address8]
    movzx ebx, BYTE [address9]
    movzx ecx, BYTE [address10]
    movzx edx, BYTE [address11]
    ...
    movzx edx, BYTE [address998]
    movzx esi, BYTE [address999]
    ret

Of course all of the addresses used would be calculated by the code that generates the instructions.

Note: Depending on which specific CPU it is, it may be faster to have a loop rather than completely unrolling (there's a compromise between instruction fetch and decode costs and loop overhead). For more recent Intel CPUs there's a thing called a loop stream detector designed to avoid fetch and decode for loops smaller than a certain size (where that size depends on CPU model); and I'd assume generating a loop that fits within that size would be optimal.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • Yeah, that same idea occurred to me (you actually only need a bit of duplicate mapping, then you can combine that with the approach Peter mentions of unrolling and over-running the buffer slightly: the overrun is caught by the extra duplicate pages you mapped). It isn't clear to me that this is performance neutral however (e.g, some caches may only be able to keep one virtual mapping for any line in the L1 at once) and so I asked [about it here](https://stackoverflow.com/q/47957559/149138). – BeeOnRope Dec 24 '17 at 03:51
  • Note that Skylake's loop buffer (LSD) is disabled by the microcode update that mixed the rare partial-register merging bug. So effectively [SKL, SKX, and KBL don't have an LSD, just uop cache. Wikichip says it's re-enabled in Coffee Lake](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake). It's very likely worth making a loop that fits in uop cache; although `movzx` with an absolute or RIP-relative addressing mode is less than 8 bytes so the front-end length pre-decode + decode bottleneck is probably exactly 2 IPC, matching max load throughput. – Peter Cordes Dec 24 '17 at 06:54
  • And BTW, it's not always a win (for non-benchmarking purposes) to make loops small enough to run from the LSD, when they will run so many iterations that unrolling is worth the overall code-size cost. The uop cache is quite good, especially in SKL and later. But 28 uops is often plenty for small loops. (That's the smallest on any CPU that has a LSD. SKL's is 64 uops, with or without HT enabled.) – Peter Cordes Dec 24 '17 at 06:57
  • @PeterCordes: There's something oddly pleasing about LSD causing bizarre ("hallucinogenic") partial register effects... ;-) – Brendan Dec 24 '17 at 07:26
2

About that math. proof ... at the beginning of unrolled loop, if ecx < STRIDE, and n = (SIZE div STRIDE), and SIZE is not divisible by STRIDE, then (n-1)*STRIDE < SIZE, i.e. n-1 iterations are safe without masking. The n-th iteration may, and may not need masking (depends on initial ecx). If the n-th iteration did not need mask, the (n+1)-th will need it.

The consequence is, that you can design code like this

    xor    ecx, ecx
    jmp    loop_entry
unrolled_loop:
    and    ecx, MASK     ; make ecx < STRIDE again
    jz     terminate_loop
loop_entry:
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    ... (SIZE div STRIDE)-1 times
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE

    ;after (SIZE div STRIDE)-th add ecx,STRIDE
    cmp    ecx, SIZE
    jae    unrolled_loop
    movzx  eax, BYTE [rsi+rcx]
    add    ecx, STRIDE
    ; assert( ecx >= SIZE )
    jmp    unrolled_loop

terminate_loop:

The amount of add happening before and is needed is not regular, it will be n or n+1, so the end of unrolled loop has be doubled, to start each unrolled loop with ecx < STRIDE value.

I'm not good with nasm macros to decide if this can be unrolled by some kind of macro magic.

Also there's a question whether this can be macro-ed to different registers, like

    xor    ecx, ecx

    ...
loop_entry:
    lea    rdx,[rcx + STRIDE*4]  
    movzx  eax, BYTE [rsi + rcx]
    movzx  eax, BYTE [rsi + rcx + STRIDE]
    movzx  eax, BYTE [rsi + rcx + STRIDE*2]
    movzx  eax, BYTE [rsi + rcx + STRIDE*3]
    add    ecx, STRIDE*8
    movzx  eax, BYTE [rsi + rdx]
    movzx  eax, BYTE [rsi + rdx + STRIDE]
    movzx  eax, BYTE [rsi + rdx + STRIDE*2]
    movzx  eax, BYTE [rsi + rdx + STRIDE*3]
    add    edx, STRIDE*8
    ...

    then the final part can be filled with simple
    movzx  eax, BYTE [rsi + rcx]
    add    ecx, STRIDE
    ... until the n-th ADD state is reached, then jae loop final

    ;after (SIZE div STRIDE)-th add ecx,STRIDE
    cmp    ecx, SIZE
    jae    unrolled_loop
    movzx  eax, BYTE [rsi + rcx]
    add    ecx, STRIDE
    ; assert( ecx >= SIZE )
    jmp    unrolled_loop

Also the inner "safe" part can be looped by some amounts, like if SIZE div STRIDE = 31.97657965357404 in your example, then the inner 8 times movzx can be looped 3 times ... 3*8 = 24, then 7 times the non-and simple lines to reach 31x add, then the doubled loop exit follows to reach eventually 32nd add as needed.

Although in case of your 31.9 it looks pointless, would make sense to loop over middle part in case of like hundreds+ = SIZE div STRIDE.

Ped7g
  • 16,236
  • 3
  • 26
  • 63
1

If you use AVX2 gather to generate the load uops, you can use SIMD for the add + AND. This probably isn't what you want when trying to measure anything about non-gather loads, though!

If your region size is 2^16..19, you can use add ax, dx (with DX=stride to avoid LCP stalls) to get wrapping at 2^16 for free. Use eax as a scaled index. With lea di, [eax + STRIDE * n] and so on in an unrolled loop, this could save enough uops to let you run 2 loads per clock without bottlenecking on the front-end. But the partial-register merging dependencies (on Skylake) would create multiple loop-carried dep chains, and you'd run out of registers in 32-bit mode if you need to avoid reusing them.

You could even consider mapping the low 64k of virtual memory (on Linux set vm.mmap_min_addr=0) and using 16-bit addressing modes in 32-bit code. Reading only 16-bit registers avoids complications of having to only write 16 bits; it's fine to end up with garbage in the upper 16.


To do better without 16-bit addressing modes, you need to create conditions where you know wrapping can't happen. This allows unrolling with [reg + STRIDE * n] addressing modes.

You could write a normal unrolled loop that breaks out when approaching the wrap-around point (i.e. when ecx + STRIDE*n > bufsize), but that won't predict well if bufsize / STRIDE is greater than about 22 on Skylake.

You could simply do the AND masking only once per iteration, and relax the constraint that the working set is exactly 2^n bytes. i.e. if you reserve enough space for your loads to go beyond the end by up to STRIDE * n - 1, and you're ok with that cache behaviour, then just do it.

If you choose your unroll factor carefully, you can maybe control where the wraparound will happen every time. But with a prime stride and a power of 2 buffer, I think you'd need an unroll of lcm(stride, bufsize/stride) = stride * bufsize/stride = bufsize for the pattern to repeat. For buffer sizes that don't fit in L1, this unroll factor is too large to fit in the uop cache, or even L1I. I looked at a couple small test cases, like n*7 % 16, which repeats after 16 iterations, just like n*5 % 16 and n*3 % 16. And n*7 % 32 repeats after 32 iterations. i.e. the linear congruential generator explores every value less than the modulus when the multiplier and modulus are relatively prime.

None of these options are ideal, but that's the best I can suggest for now.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847