Once again I'm jumping back into teaching myself basic assembly language again so I don't completely forget everything.
I made this practice code the other day, and in it, it turned out that I had to plop the results of a vector operation into an array backwards; otherwise it gave a wrong answer. Incidentally, this is also how GCC similarly outputs assembly code for SIMD operation results going back to a memory location, so I presume it's the "correct" way.
However, something occurred to me, from something I've been conscious about as a game developer for quite a long time: cache friendliness. My understanding is that moving forward in a contiguous block of memory is always ideal, else you risk cache misses.
My question is: Even if this example below is nothing more than calculating a couple of four-element vectors and spitting out four numbers before quitting, I have to wonder if this -- putting numbers back into an array in what's technically in reverse order -- has any impact at all on cache misses in the real world, within in a typical production-level program that does hundreds of thousands of SIMD vector calculations (and more specifically, returning them back to memory) per second?
Here is the full code (linux 64-bit NASM) with comments including the original one that prompted me to bring this curiosity of mine to stackexchange:
extern printf
extern fflush
global _start
section .data
outputText: db '[%f, %f, %f, %f]',10,0
align 16
vec1: dd 1.0, 2.0, 3.0, 4.0
vec2: dd 10.0,10.0,10.0,50.0
section .bss
result: resd 4 ; four 32-bit single-precision floats
section .text
_start:
sub rsp,16
movaps xmm0,[vec1]
movaps xmm1,[vec2]
mulps xmm0,xmm1 ; xmm0 = (vec1 * vec2)
movaps [result],xmm0 ; copy 4 floats back to result[]
; printf only accepts 64-bit floats for some dumb reason,
; so convert these 32-bit floats packed within the 128-bit xmm0
; register into four 64-bit floats, each in a separate xmm* reg
movss xmm0,[result+12] ; result[3]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm3,xmm0 ; put double in 4th XMM
movss xmm0,[result+8] ; result[2]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm2,xmm0 ; put double in 3rd XMM
movss xmm0,[result+4] ; result[1]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm1,xmm0 ; put double in 2nd XMM
movss xmm0,[result] ; result[0]
unpcklps xmm0,xmm0 ; 32-->64 bit
cvtps2pd xmm0,xmm0 ; put double in 1st XMM
; FOOD FOR THOUGHT!
; *****************
; That was done backwards, going from highest element
; of what is technically an array down to the lowest.
;
; This is because when it was done from lowest to
; highest, this garbled bird poop was the answer:
; [13510801139695616.000000, 20.000000, 30.000000, 200.000000]
;
; HOWEVER, if the correct way is this way, in which
; it traipses through an array backwards...
; is that not cache-unfriendly? Or is it too tiny and
; miniscule to have any impact with cache misses?
mov rdi, outputText ; tells printf where is format string
mov rax,4 ; tells printf to print 4 XMM regs
call printf
mov rdi,0
call fflush ; ensure we see printf output b4 exit
add rsp,16
_exit:
mov eax,1 ; syscall id for sys_exit
mov ebx,0 ; exit with ret of 0 (no error)
int 80h