Unless some other code needs it in memory, it's cheaper to generate on the fly a vector with all 128 bits set to 1 = 0xFF... repeating = 2^128-1:
pcmpeqw xmm0, xmm0 ; xmm0 = 0xFF... repeating
;You can store to memory if you want, e.g. to set a bitmap to all-ones.
movups [rdx], xmm0
See also What are the best instruction sequences to generate vector constants on the fly?
For the use-case you described in comments, there's no reason to mess with static data in .data
or .rodata
, or static storage in .bss
. Just make space on the stack and pass pointers to that.
call_something_by_ref:
sub rsp, 24
pcmpeqw xmm0, xmm0 ; xmm0 = 0xFF... repeating
mov rdi, rsp
movaps [rdi], xmm0 ; one byte shorter than movaps [rsp], xmm0
lea rsi, [rdi+8]
call some_function
add rsp, 24
ret
Notice that this code has no immediate constants larger than 8 bits (for data or addresses), and it only touches memory that's already hot in cache (the bottom of the stack). And yes, store-forwarding does work from wide vector stores to integer loads when some_function
dereferences RDI and RSI separately.