Packing two DWORDs into a QWORD to save store bandwidth

Question

Imagine a load-store loop like the following which loads DWORDs from non-contiguous locations and stores them contiguously:

top:
mov eax, DWORD [rsi]
mov DWORD [rdi], eax
mov eax, DWORD [rdx]
mov DWORD [rdi + 4], eax
; unroll the above a few times
; increment rdi and rsi somehow
cmp ...
jne top

On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one load).

One idea that naturally arises is to combine two DWORD loads into a single QWORD store which is possible since the stores are contiguous. Something like this could work:

top:
mov eax, DWORD [rsi]
mov ebx, DWORD [rdx]
shl rbx, 32
or  rax, rbx
mov QWORD [rdi]

Basically do the two loads and use two ALU ops to combine them into a single QWORD which we can store with a single store. Now we're bottlenecked on uops: 5 uops per 2 DWORDs - so 1.25 cycles per QWORD or 0.625 cycles per DWORD.

Already much better than the first option, but I can't help but think there is a better option for this shuffling - for example, we are wasting uop throughput by using plain loads - It feels like we should be able to combine at least some of the ALU ops with the loads with memory source operands, but I was mostly stymied on Intel: shl on memory only has a RMW form, and shlx and rolx don't micro-fuse.

It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.

I'm interested in scalar code, and code for both the base x86-64 instruction set and better versions if possible with useful extensions like BMI.

MMX `punpckldq mm0, [mem]` micro-fuses on SnB-family (including Skylake), so `movd`-load / `punpckldq`-load / `movq`-store is usable if loading a `qword` from one of the `dword` locations is ok for correctness and performance. — Peter Cordes, Nov 11 '17 at 20:24
`increment somehow` isn't meant to imply that you're interleaving two contiguous arrays, is it? Obviously SSE2 `punpckldq` / `hdq` does that well. — Peter Cordes, Nov 11 '17 at 20:26
@PeterCordes - well the point was to hide the details of the two arrays and avoid people using the fact that they are contiguous. In my application they aren't (it's a bit more complex than the above, but I was trying to keep the use-case simple). You could imagine that the unrolled versions have things like [rsi + 0x1234] as their source if you want (i.e., big jumps). I'm mostly interested in scalar code here. — BeeOnRope, Nov 11 '17 at 20:30
Ok, that's what I thought. No opportunity to load data for multiple iterations with a vector and shuffle. — Peter Cordes, Nov 11 '17 at 20:32
Using an offset `QWORD` load is OK as mentioned, but may be less ideal than solutions that don't use it due to cache-line splitting, and the need to handle the boundary case where such a read may read before the buffer. Using the `mmx` instructions is an interesting idea... — BeeOnRope, Nov 11 '17 at 20:33
@JonathonReinhart that doesn't help, only one store can be executed per cycle so if they are not merged *manually*, it is already too late. — harold, Nov 11 '17 at 21:16

Peter Cordes · Accepted Answer · 2019-03-08T09:33:26.703

It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.

If wider loads are ok for correctness and performance (cache-line splits...), we can use shld

top:
    mov eax, DWORD [rsi]
    mov rbx, QWORD [rdx-4]     ; unaligned(?) 64-bit load

    shld rax, rbx, 32          ; 1 uop on Intel SnB-family, 0.5c recip throughput
    mov QWORD [rdi], rax

MMX punpckldq mm0, [mem] micro-fuses on SnB-family (including Skylake).

top:
    movd       mm0, DWORD [rsi]
    punpckldq  mm0, QWORD [rdx]     ; 1 micro-fused uop on Intel SnB-family

    movq       QWORD [rdi], mm0

 ; required after the loop, making it only worth-while for long-running loops
 emms

punpckl instructions unfortunately have a vector-width memory operand, not half-width. This often spoils them for uses where they'd otherwise be perfect (especially the SSE2 version where the 16B memory operand must be aligned). But note that the MMX versions (with only a qword memory operand) don't have an alignment requirement.

You could also use the 128-bit AVX version, but that's even more likely to cross a cache line boundary and be slow. (Skylake does not optimize by loading only the required 8 bytes; a loop with an aligned mov + vpunckldq xmm1, xmm0, [cache_line-8] runs at 1 iter per 2 clocks vs. 1 iter per clock for aligned.) The AVX version is required to fault if the 16-byte load crosses into an unmapped page, so it couldn't just use a narrower load without extra support from the load port. :/

Such a frustrating and useless design decision (presumably made before load ports could zero-extend for free, and not fixed with AVX). At least we have movhps as a replacement for memory-source punpcklqdq, but narrower widths that actually shuffle can't be replaced.

To avoid CL-splits, you could also use a separate movd load and punpckldq, or SSE4.1 pinsrd. With this, there's no reason for MMX.

top:
    movd       xmm0, DWORD [rsi]

    movd       xmm1, DWORD [rdx]           ; SSE2
    punpckldq  xmm0, xmm1
    ; or pinsrd  xmm0, DWORD [rdx], 1      ; 2 uops not micro-fused

    movq       QWORD [rdi], xmm0

Obviously AVX2 vpgatherdd is a possibility, and may perform well on Skylake.

Ah, I had skipped over `shld` because the variable shift-count version is so crap and that's the one I always wanted. The immediate version is good though (latency of 3, however)! Yeah it seems `punpckldq` is ruined because the sources have no particular alignment (not even `DWORD` alignment). — BeeOnRope, Nov 11 '17 at 20:37
@BeeOnRope: MMX memory operands don't have alignment requirements; they're only QWORD. — Peter Cordes, Nov 11 '17 at 20:43
@BeeOnRope: added an SSE2 version. Separate movd / punpckldq, but still better than 2 scalar ops on some CPUs. — Peter Cordes, Nov 11 '17 at 20:54

Packing two DWORDs into a QWORD to save store bandwidth

1 Answers1

Linked