How results of a SIMD operation go back into an array: cache-unfriendly?

Question

Once again I'm jumping back into teaching myself basic assembly language again so I don't completely forget everything.

I made this practice code the other day, and in it, it turned out that I had to plop the results of a vector operation into an array backwards; otherwise it gave a wrong answer. Incidentally, this is also how GCC similarly outputs assembly code for SIMD operation results going back to a memory location, so I presume it's the "correct" way.

However, something occurred to me, from something I've been conscious about as a game developer for quite a long time: cache friendliness. My understanding is that moving forward in a contiguous block of memory is always ideal, else you risk cache misses.

My question is: Even if this example below is nothing more than calculating a couple of four-element vectors and spitting out four numbers before quitting, I have to wonder if this -- putting numbers back into an array in what's technically in reverse order -- has any impact at all on cache misses in the real world, within in a typical production-level program that does hundreds of thousands of SIMD vector calculations (and more specifically, returning them back to memory) per second?

Here is the full code (linux 64-bit NASM) with comments including the original one that prompted me to bring this curiosity of mine to stackexchange:

extern printf
extern fflush

global _start
section .data
outputText:     db '[%f, %f, %f, %f]',10,0

align 16
vec1:    dd 1.0, 2.0, 3.0, 4.0
vec2:    dd 10.0,10.0,10.0,50.0

section .bss
result:  resd 4       ; four 32-bit single-precision floats

section .text
_start:
    sub rsp,16

    movaps xmm0,[vec1]
    movaps xmm1,[vec2]

    mulps xmm0,xmm1          ; xmm0 = (vec1 * vec2)

    movaps [result],xmm0     ; copy 4 floats back to result[]

    ; printf only accepts 64-bit floats for some dumb reason,
    ; so convert these 32-bit floats packed within the 128-bit xmm0
    ; register into four 64-bit floats, each in a separate xmm* reg
    movss xmm0,[result+12]   ; result[3]
    unpcklps xmm0,xmm0       ; 32-->64 bit
    cvtps2pd xmm3,xmm0       ; put double in 4th XMM

    movss xmm0,[result+8]    ; result[2]
    unpcklps xmm0,xmm0       ; 32-->64 bit
    cvtps2pd xmm2,xmm0       ; put double in 3rd XMM

    movss xmm0,[result+4]    ; result[1]
    unpcklps xmm0,xmm0       ; 32-->64 bit
    cvtps2pd xmm1,xmm0       ; put double in 2nd XMM

    movss xmm0,[result]      ; result[0]
    unpcklps xmm0,xmm0       ; 32-->64 bit
    cvtps2pd xmm0,xmm0       ; put double in 1st XMM

    ; FOOD FOR THOUGHT!
    ; *****************
    ; That was done backwards, going from highest element 
    ; of what is technically an array down to the lowest.
    ; 
    ; This is because when it was done from lowest to
    ; highest, this garbled bird poop was the answer:
    ; [13510801139695616.000000, 20.000000, 30.000000, 200.000000]
    ;
    ; HOWEVER, if the correct way is this way, in which
    ; it traipses through an array backwards...
    ; is that not cache-unfriendly?  Or is it too tiny and
    ; miniscule to have any impact with cache misses?

    mov rdi, outputText     ; tells printf where is format string

    mov rax,4               ; tells printf to print 4 XMM regs
    call printf

    mov rdi,0
    call fflush             ; ensure we see printf output b4 exit

    add rsp,16

_exit:  
    mov eax,1            ; syscall id for sys_exit
    mov ebx,0            ; exit with ret of 0 (no error)
    int 80h

score 0 · Accepted Answer · edited May 23 '17 at 10:29

HW prefetchers can recognize streams with descending addresses as well as ascending. Intel's optimization manual documents the HW prefetchers in fair detail. I think AMD's prefetchers are broadly similar in terms of being able to recognize descending patterns as well.

Within a single cache-line, it doesn't matter at all what order you access things in, AFAIK.

See the x86 tag wiki for more links, especially Agner Fog's Optimizing Assembly guide to learn how to write asm that isn't slower than what a compiler could make. The tag wiki also has links to Intel's manuals.

Also, that is some ugly / bad asm. Here's how to do it better:

Printf only accepts double because of C rules for arg promotion to variadic functions. Yes, this is kinda dumb, but FP->base-10-text conversion dwarfs the overhead from an extra float->double conversion. If you need high-performance FP->string, you probably should avoid using a function that has to parse a format string every call.

debug-prints in ASM are usually more trouble than they're worth, compared to using a debugger.

Also:

This is 64-bit code, so don't use the 32-bit int 0x80 ABI to exit.
The UNPCKLPS instructions are pointless, because you only care about the low element anyway. CVTPS2PD produces two results, but you're converting the same number twice in parallel instead of converting two and then unpacking. Only the low double in an XMM matters, when calling a function that takes scalar args, so you can leave high garbage.
Store/reload is also pointless

DEFAULT REL            ; use RIP-relative addressing for [vec1]

extern printf
;extern fflush         ; just call exit(3) instead of manual fflush
extern exit

section .rodata        ; read-only data can be part of the text segment
outputText:     db '[%f, %f, %f, %f]',10,0

align 16
vec1:    dd 1.0, 2.0, 3.0, 4.0
vec2:    dd 10.0,10.0,10.0,50.0

section .bss
;; static scratch space is unwise.  Use the stack to reduce cache misses, and for thread safety
; result:  resd 4       ; four 32-bit single-precision floats

section .text
global _start
_start:
    ;; sub rsp,16            ; What was this for?  We have a red-zone in x86-64 SysV, and we don't use

    movaps    xmm2, [vec1]
    ; fold the load into the mulps
    mulps     xmm2, [vec2]   ; (vec1 * vec2)

    ; printf only accepts 64-bit doubles, because it's a C variadic function.
    ; so convert these 32-bit floats packed within the 128-bit xmm0
    ; register into four 64-bit floats, each in a separate xmm* reg

    ; xmm2 = [f0,f1,f2,f3]
    cvtps2pd  xmm0, xmm2     ; xmm0=[d0,d1]
    movaps    xmm1, xmm0
    unpckhpd  xmm1, xmm1     ; xmm1=[d1,d1]

    unpckhpd  xmm2, xmm2     ; xmm2=[f2,f3, f2,f3]

    cvtps2pd  xmm2, xmm2     ; xmm2=[d2,d3]
    movaps    xmm3, xmm3
    unpckhpd  xmm3, xmm3     ; xmm3=[d3,d3]

    mov       edi, outputText     ; static data is in the low 2G, so we can use 32-bit absolute addresses
    ;lea      rdi, [outputText]   ; or this is the PIC way to do it

    mov       eax,4               ; tells printf to print 4 XMM regs
    call      printf

    xor       edi, edi
    ;call      fflush              ; flush before _exit()
    jmp       exit                 ; tailcall exit(3) which does flush, like if you returned from main()

    ; add rsp,16

;; this is how you would exit if you didn't use the libc function.
_exit:  
    xor       edi, edi
    mov       eax, 231             ;  exit_group(0)
    syscall                        ; 64-bit code should use the 64-bit ABI

You could also use MOVHLPS to move the high 64 bits from one register into the low 64 bits of another reg, but that has a false dependency on the old contents.

    cvtps2pd  xmm0, xmm2     ; xmm0=[d0,d1]

    ;movaps    xmm1, xmm0
    ;unpckhpd  xmm1, xmm1     ; xmm1=[d1,d1]

    ;xorps     xmm1, xmm1    ; break the false dependency
    movhlps   xmm1, xmm0     ; xmm1=[d1,??]  ; false dependency on old value of xmm1

On Sandybridge, xorps and movhlps would be more efficient, because it can handle xor-zeroing without using an execution unit. IvyBridge and later, and AMD CPUs, can eliminate the MOVAPS the same way: zero latency. But still takes a uop and some frontend throughput resources.

If you were going to store and reload, and convert each float separately, you'd use CVTSS2SD, either as a load (cvtss2sd xmm2, [result + 12]) or after movss.

Using MOVSS first would break the false-dependency on the full register, which CVTSS2SD has because Intel designed it badly, to merge with the old value instead of replacing. Same for int->float or double conversion. The merging case is much rarer than scalar math, and can be done with a reg-reg MOVSS.

How results of a SIMD operation go back into an array: cache-unfriendly?

1 Answers1