I am wondering if there is any sequence of instructions without use of any other register to copy the lower 32 bits of RAX to its higher 32 bits. Of course, I want EAX intact as well.
Preferably without using any stack memory, either.
I am wondering if there is any sequence of instructions without use of any other register to copy the lower 32 bits of RAX to its higher 32 bits. Of course, I want EAX intact as well.
Preferably without using any stack memory, either.
My try... got headache from the music compo going on here at the demo party (or more likely from the travel here), so I gave up on the imul rax,rax,imm32
used to copy 31 bits and trying to save the 1 bit in sign and then patch the intermediate result, as it turned out to have several problems I didn't foresee.
So I instead opted to chicken-out, and copy only 2x 16 bit words, and then reshuffle in a way how Jester did (I was on a verge of posting idea of xchg al,ah
just when he posted answer, I swear).
; rax = 00001234 (bytes, not hexa digits)
ror rax, 16 ; 34000012
imul rax, rax, 0x00010001 ; 34001212
shr rax, 16 ; 00340012
imul rax, rax, 0x00010001 ; 34341212
ror rax, 24 ; 21234341
xchg al, ah ; 21234314
ror rax, 8 ; 42123431
xchg al, ah ; 42123413
rol rax, 16 ; 12341342
xchg al, ah ; 12341324
ror rax, 8 ; 41234132
xchg al, ah ; 41234123
rol rax, 8 ; 12341234
A bit shorter (instruction count) variant (with a funny twist of using rol ...,8
instructions only since 5th one):
; eax = 00001234 (bytes, not hexa digits)
ror rax, 8 ; 40000123
imul rax, rax, 0x01000001 ; 40123123
rol rax, 16 ; 12312340
mov al, ah ; 12312344
rol rax, 8 ; 23123441
rol ax, 8 ; 23123414
rol rax, 8 ; 31234142
rol ax, 8 ; 31234124
rol rax, 8 ; 12341243
rol ax, 8 ; 12341234
There has to be a simpler way :-D
shl rax, 8 ; 00012340
mov al, ah ; 00012344
bswap rax ; 44321000
shr rax, 16 ; 00443210
mov al, ah ; 00443211
ror rax, 8 ; 10044321
xchg ah, al ; 10044312
rol rax, 8 ; 00443121
xchg ah, al ; 00443112
shl rax, 8 ; 04431120
mov al, ah ; 04431122
ror rax, 32 ; 11220443
xchg ah, al ; 11220434
ror rax, 8 ; 41122043
xchg ah, al ; 41122034
ror rax, 8 ; 44112203
mov ah, al ; 44112233
ror rax, 8 ; 34411223
xchg ah, al ; 34411232
rol rax, 16 ; 41123234
xchg ah, al ; 41123243
ror rax, 8 ; 34112324
xchg ah, al ; 34112342
rol rax, 24 ; 12342341
xchg ah, al ; 12342314
ror rax, 8 ; 41234231
xchg ah, al ; 41234213
ror rax, 8 ; 34123421
xchg ah, al ; 34123412
ror rax, 16 ; 12341234
It's hard to do this efficiently without touching a 2nd register, as requested. The only suggestion in comments that won't suck for performance is using a 64-bit constant for imul
:
; mov eax,eax ; if eax isn't already zero-extended into rax
imul rax, [rel broadcast_low32_multiplier]
section .rodata
broadcast_low32_multiplier: dq 0x100000001
This has 1-per-clock throughput (and 3 cycle latency) on Intel Sandybridge-family CPUs, and on AMD Ryzen. But only one per 4 clocks on Bulldozer-family. (http://agner.org/optimize/, https://uops.info/)
You can't do this with imul r64, r64, imm32
, because our constant has 33 bits. You can factor 0x100000001 into 0x663d81 * 0x281
, and do it in 2 uops, and 6 cycle latency on CPUs with fast multipliers. (Thanks, @stepan)
;mov ecx, eax ; zero extend into a different reg with 0 latency, and imul from there, if not already zero-extended
imul rax, rax, 0x281
imul rax, rax, 0x663d81 ; 0x281 * 0x663d81 = 0x100000001
The other option to make a fully self-contained fragment of x86-64 machine code, without reference to an external constant (e.g. for shellcode, or because you don't expect the constant to be hot in cache) is to put the constant inline with the instructions and jump over it.
; mov eax,eax ; if eax isn't already zero-extended into rax
imul rax, [rel broadcast_low32_multiplier]
jmp after_constant
broadcast_low32_multiplier: dq 0x100000001
after_constant:
From a correctness POV, there's no difference between this and instructions using immediates. NASM listing:
1 00000000 480FAF0502000000 imul rax, [rel broadcast_low32_multiplier]
2 00000008 EB08 jmp after_constant
3 0000000A 0100000001000000 broadcast_low32_multiplier: dq 0x100000001
4 after_constant:
It does do a data load from the same page as the code, but AFAIK it's impossible to make a page executable but not readable with page table settings, so there's no way for this to fail where pure instructions would work. It will use up a cache line in L1D cache, and a dTLB entry, as well as L1i cache and an iTLB entry, though. It may even miss in dTLB if the entry is only hot in iTLB, and will probably miss in L1d cache (but probably hit in L2 cache; most CPUs have some non-exclusive unified cache that instruction-fetch goes through. See this more about performance lack-of-benefits for mixing code and data.)
Jester's suggestion in comments of 2 stores / 1 reload is compact, but will cause a store-forwarding stall on all CPUs except in-order Atom (pre Silvermont):
push rax
mov [rsp+4], eax ; overwrite high bytes
pop rax ; store-forwarding stall when a wide load covers 2 narrow stores
This may be the smallest total size version, especially if you have a stack frame that lets you use a rbp+disp8
addressing mode like [rbp-12]
, saving 1 byte vs. rsp
-relative (which needs a SIB even with no index register).
(The "other" standard way would be mov rcx, 0x100000001
/ imul rax, rcx
, but a 64-bit immediate is large, and shift/OR is actually lower latency.)
It's likely going to be worth it to avoid needing to do exactly what you asked, if performance matters.
E.g. save/restore one of the other 15 general-purpose registers around whatever you're doing, so you can use it as a scratch register. Ideally around the whole function, so you can do this broadcast inside a loop using a scratch reg. Or just clobber without saving first, if it's something you can just reload or recalc quickly.
; rax = garbage:eax
push rbx ; save/restore if needed, preferably at the top of the function
mov ebx, eax ; clear high garbage with 32-bit zero extend
shl rax, 32 ; clear low garbage via shifting
or rax, rbx ; or lea rcx, [rax+rbx] into a 3rd reg if you want
...
pop rbx ; preferably at the bottom of the function
This has 2 cycle latency (for rax), even without mov-elimination. We shift the original RAX in parallel with copying (with zero-extension) into RBX. If EAX was already zero-extended, you could shift the copy in RBX, if you were optimizing for a CPU with mov-elimination: Intel's implementation frees up mov-elimination tracking resources when you overwrite the mov destination. But that would make it higher latency for CPUs without mov-elimination, putting the mov on the critical path. (A microcode update for Ice Lake disables mov-elimination. /sigh.)
The push/pop introduces a store/reload into the critical path for RBX, so choose your scratch reg wisely if you actually have to do this around every broadcast, instead of just freeing up an extra register for a whole loop or function.
Or with BMI2, rorx rcx, rax, 32
/ or rax, rcx
gets the job done in only 2 uops. (The upper 32 bit of RAX must already be zeroed for this, unlike with the mov/shift).
Or use SSE2 (baseline for x86-64). If you can use xmm0 instead of rax in the first place:
; movd xmm0, eax ; instead of whatever was setting rax
punpckldq xmm0, xmm0 ; [dcba] -> [bbaa]
; movq rax, xmm0
1 cycle latency on most CPUs (except SlowShuffle Core2 / K8, where you could use pshuflw
with the right imm8 to make this efficient), plus whatever extra cost for using xmm0 instead of rax.
Copying RAX to/from XMM0 is most of the cost here if you do it that way, like 4 to 6 cycle round trip depending on microarchitecture, on top of the shuffle. (e.g. uops.info's Zen2 latency tests include a movd/movq round trip measured at 6 cycles). That would clobber xmm0, which might or might not be ok.
But you could avoid that cost if you were using xmm0 the whole time for your integer value. You can do scalar integer math in an xmm register (ignoring what happens to upper bytes). Instructions like paddd
/ paddq
, pslldq
, pmuludq
/ pand
are available. The main thing you can't do is use it in an addressing mode, but if you're broadcasting the low 32 bits this probably isn't an address or an index.