I'd like to write a load benchmark that strides across a given region of memory with a compile-time known stride, wrapping at the end of the region (power-of-2) with as few non-load instructions as possible.
For example given a stride of 4099, and the iteration count in rdi
and the pointer to the memory region in rsi
something that "works" is:
%define STRIDE 4099
%define SIZE 128 * 1024
%define MASK (SIZE - 1)
xor ecx, ecx
.top:
mov al, [rsi + rcx]
add ecx, STRIDE
and ecx, MASK
dec rdi
jnz .top
The problem is that there are 4 non-load instructions just to support the single load, dealing with the stride addition, masking and loop termination check. Also, there is a 2-cycle dependency chain carried through ecx
.
We can unroll this a bit to reduce the loop termination check cost to near zero, and break up the dependency chain (here unrolled by 4x):
.top:
lea edx, [ecx + STRIDE * 0]
and edx, MASK
movzx eax, BYTE [rsi + rdx]
lea edx, [ecx + STRIDE * 1]
and edx, MASK
movzx eax, BYTE [rsi + rdx]
lea edx, [ecx + STRIDE * 2]
and edx, MASK
movzx eax, BYTE [rsi + rdx]
lea edx, [ecx + STRIDE * 3]
and edx, MASK
movzx eax, BYTE [rsi + rdx]
add ecx, STRIDE * 4
dec rdi
jnz .top
However, this it doesn't help for the add
and and
operations dealing with the wrapping the stride. For example, the above benmark reports 0.875 cycles/load for an L1-contained region, but we know the right answer is 0.5 (two loads per cycle). The 0.875 comes from 15 total total uops / 4 uops cycle, i.e., we are bound by the 4-wide maximum width of uop throughput, not on load throughput.
Any idea on how to a good way to effectively unroll the loop to remove most of the cost of the stride calculation?