How to optimize for writes to memory in hot loop

Question

I have a loop in my code where I spend most of the CPU time:

%%write_to_intermediate_head:
    ; uint64_t count_index = (arr[i] >> (8 * cur_digit)) & (RADIX - 1);
    ; We will use rsi to hold the digit we're currently examining
    mov rsi, [rdx + rdi * 8] ; Load arr[i] into rsi
    mov r9, %1 ; Load current digit were sorting on
    shl r9, 3  ; Multiply by 8
    shrx rsi, rsi, r9
    and rsi, 255

    ; intermediate_arr[--digit_counts[count_index]] = arr[i];
    ; rdi is the loop counter i
    ; rsi is the count_index
    ; rcx is the digit_counts array
    ; rax is the intermediate_arr
    dec qword [rcx + rsi * 8]
    mov r9, [rcx + rsi * 8] ; --digit_counts[count_index]

    mov rsi, [rdx + rdi * 8] ; arr[i]
    mov [rax + r9 * 8], rsi

    dec rdi
    jnz %%write_to_intermediate_head

The variables: digit_counts, arr, and intermediate_arr are all in memory (heap and bss). The AMD profiler shows that many cycles are spent reading and writing to these memory locations. Is there any way to optimize this?

Write it in C and use an optimizing compiler? – Erik Eidt Jun 17 '21 at 15:13 — Erik Eidt, Jun 17 '21 at 15:13

Peter Cordes · Accepted Answer · 2021-06-17T16:25:03.363

Do your counts truly need to be qwords, or could you use a narrower type to cut your cache footprint in half with 32-bit (or even less with narrower)? If you're getting cache misses, that's going to mean much more time spent waiting for loads/stores if OoO exec can't hide that latency.

I guess copying the data around is going to be most of the memory bandwidth / cache misses, though. This looks like Radix Sort, and the amount of metadata to manage is smallish compared to the data. (But having at least it hit in cache can help, making it more resistant to eviction by all the other data you're throwing around.)

No matter what you do, the access pattern of Radix Sort is inherently not very cache-friendly, although it's not terrible. You're scattering writes across 256 possible destinations, along with updating the pointers. But those 256 destinations are sequential streams so they can hit in L1d cache if you're lucky.

Hopefully those destinations aren't multiples of 4k apart (initially or most of the time), otherwise they'll alias the same line in L1d cache and cause conflict misses. (i.e. force eviction of another partially-written cache line that you're soon going to write to.)

You have some redundant loads / stores which may be a bottleneck for load/store execution units, but if that's not the bottleneck then cache will absorb them just fine. This section is mostly about tuning the loop to use fewer uops, improving things in the no-cache-miss best case, and giving OoO exec less latency to hide.

Using a memory-destination dec and then reloading the dec's store seems obviously inefficient in terms of total back-end load/store operations, and latency for OoO exec to hide. (Although on AMD, dec mem is still a single uop for the front-end, vs. 3 on Intel; https://uops.info/ and https://agner.org/optimize/).

Similarly, aren't you loading [rdx + rdi * 8] ; arr[i] twice, with the same RDI? SHRX can copy-and-shift so you wouldn't even be saving uops by keeping around that load result for later. (You could also use a simple non-indexed addressing mode for arr[i], by doing a pointer-increment like add rdi,8 and cmp rdi, endp/jne where end is something you calculated earlier with lea endp, [rdx + size*8]. Looping forward over an array can be better for some HW prefetchers.)

x86-64 has 15 registers, so if you need more for this inner loop, save/restore some call-preserved registers (like RBX or RBP) at the top/bottom of the function. Or spill some outer-loop stuff to memory if necessary.

mov r9, %1 looks loop-invariant, so hoist that shl r9, 3 calculation out of the loop, and don't overwrite R9 inside the inner loop.

You do need to zero-extend the byte you extracted, but and rsi, 255 is not as efficient as movzx eax, sil. (Or better, pick a register like ECX whose low byte can be accessed without a REX prefix). AMD can't do mov-elimination on movzx the way Intel can, though, so just saving code size for AMD, but optimizing latency if you ever run this on Intel CPUs.

Or better, AMD Zen has single-uop BMI1 bextr r64,r64,r64, so prepare a start/length pair in the low 2 bytes of a register. As discussed, that's loop invariant. i.e. before the loop mov ecx, %k1 / shl cl, 3 / mov ch, 0x8 (AMD doesn't have partial-register stalls, just false dependencies. In this case true, since we want to merge.) If that's inline asm syntax, %k1 specifies the 32-bit version of the register. Or if it's memory, you're just loading anyway, and hoisting it will save another load!

(Intel has 2-uop bextr, presumably shift and bzhi uops.)

Or if you really want to load twice, movzx esi, byte [rdi + rdx] where RDI is a pointer to arr[[i] that you increment or decrement, and RDX is a byte offset. But probably BEXTR is a better bet.

Thank you so much for your response, this is so detailed and complete. First of all, this is indeed radix sort (full source: https://github.com/vfrazao-ns1/radix_sort/blob/main/radix.s). I've hoisted the operations on r9 outside of the loop and another idea came to mind, could I just do `lea r9, [%1 * 8]` instead of doing the mov followed by the shl? I guess if its not in the loop it doesnt really matter. I think the digit counts *may* need to be 64 bits wide since an array im sorting may have up to size_t elements and they could be all the same (theoretically at least). — lvrf, Jun 17 '21 at 17:48
@lvrf: you could maybe check for wrapping in your digit counts and special case that. Like include a `jnz` after a `dec`, or a `jnc` if you use byte offsets instead of counts; I'm not sure if AMD cares about the scale factor being 1 or not in addressing modes for memory access (not LEA). But use perf record or whatever to see if those accesses cache miss. e.g. Intel has a perf event `mem_load_retired.l1_miss` that maps to specific instructions; AMD probably has something similar. — Peter Cordes, Jun 17 '21 at 17:54
@lvrf: yes, LEA can replace mov/shl. It costs a couple bytes of extra code size, but does save a uop. (`[reg * 8]` has to be encoded as `[reg*8 + disp32=0]` since the no-base-reg encoding is how ModRM/SIB signals that there's just a displacement). It may actually be worse latency on AMD, but outside the loop is fine (scaled-index). Of course, if you know it's a small integer (high bits aren't set), `rorx eax, %k1, 32-3` will also copy-and-shift. (No need for 64-bit operand-size, other than in an LEA addressing mode: x86-64's defaults without prefix bytes are op size = 32, addr size = 64.) — Peter Cordes, Jun 17 '21 at 17:58
I've removed the redundant load of `arr[i]` into rsi (I was somehow blind to that) and I'm now (on avg) beating the O2 optimized C version I wrote. To confirm, `dec qword [rcx + rsi * 8]` / `mov r10, [rcx + rsi * 8]` is **slower** than something like `mov r10, [rcx + r11 * 8]` / `dec r10` / `mov [rcx + r11 * 8], r10` (load, dec, store). — lvrf, Jun 17 '21 at 18:03
Thanks again for all your help. I feel like I have the basics of x86 down but want to move on to some more intermediate learning materials - do you have any book or resource recommendations? — lvrf, Jun 17 '21 at 18:04
@lvrf: Another idea for getting away with 32-bit counts: maybe have two versions of your loops, and if the total input array size to be sorted is less than 2^32, then you can use the 32-bit counter version. Or if any given bucket to be sorted further is small enough, then it can use 32-bit counters. — Peter Cordes, Jun 17 '21 at 18:08
@lvrf: resources: Agner Fog's optimization guide (and especially microarchitecture guide) is essential reading. https://agner.org/optimize/. See also http://www.lighterra.com/papers/modernmicroprocessors/ re pipelines and OoO exec, and ROB / store buffer stuff, and https://www.realworldtech.com/sandy-bridge/ re: a specific microarchitecture. (Unfortunately David Kanter hasn't published a similar Zen deep dive, but see https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Architecture). — Peter Cordes, Jun 17 '21 at 18:11
https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ is great for understanding how OoO exec hides memory latency. See also [Are loads and stores the only instructions that gets reordered?](https://stackoverflow.com/a/50496379). Re: the store, see [Can a speculatively executed CPU branch contain opcodes that access RAM?](https://stackoverflow.com/q/64141366). Also other links in https://stackoverflow.com/tags/x86/info. Agner Fog's microach guide is most helpful for tuning code that bottlenecks on the pipeline itself, not as much on cache misses. — Peter Cordes, Jun 17 '21 at 18:12

score 2 · Answer 2 · answered Jun 18 '21 at 00:41

Other optimizations. The initial pass to generate counts can be done for all digits at the same time, using a matrix instead of an array. For 64 bit unsigned integers, doing 8 passes with 1 byte digits is close enough to ideal, as the counts | indexes will fit in the L1 cache. The initial pass stores counts in a [8][256] matrix, and 32 bit counts|indexes should be good enough.

For a large array much larger than cache, if the data to be sorted is reasonably uniform, then the first radix sort pass can be a most significant digit pass, creating 256 bins if using 1 byte digits, with the goal of each of the 256 bins fitting in cache, and doing radix sort least significant digit first on each of the 256 bins, one bin at a time. If the array is larger, the first pass can create more bins, 512 (9 bit digit), 1024 (10 bit digit), ..., then each bin can still be sorted using 1 byte digits with a smaller digit on the last pass.

How to optimize for writes to memory in hot loop

2 Answers2