5

I am wondering if there is any sequence of instructions without use of any other register to copy the lower 32 bits of RAX to its higher 32 bits. Of course, I want EAX intact as well.

Preferably without using any stack memory, either.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
masec
  • 584
  • 5
  • 16
  • 1
    You can use the stack. – Jester Feb 02 '18 at 17:14
  • Could you please elaborate? – masec Feb 02 '18 at 17:16
  • 1
    Or `imul` against memory constant? – Ped7g Feb 02 '18 at 17:17
  • And any idea without using stack as well? – masec Feb 02 '18 at 17:17
  • 4
    `push rax; mov [rsp+4], eax; pop rax` for example. – Jester Feb 02 '18 at 17:19
  • The stack method would be to put to copies of the 32-bit register (lower bits) onto the stack and then using pop to take the values and pop them into a 64-bit register. – Michael Petch Feb 02 '18 at 17:19
  • you can use `lea rax,[rax + rax*8]` to copy 3 bits to next 3 bits, and you can make 3 bit empty spaces by rotations and shifts, but I'm not sure how much work it would be to reshuffle it back into 32:32 two copies. If you didn't get my draft yet, it would require MANY instructions, so this is not a real proposal, just a guess that it is *possible*, but not practical. – Ped7g Feb 02 '18 at 17:20
  • Actually there's `imul rax, rax, imm32` option, which can be used to copy 31 bits, and one `sar` to copy top bit... hmm... ok, I think I have a solution in 5-10 instructions only. (going to verify + post answer) – Ped7g Feb 02 '18 at 17:24
  • @Ped7g `imul` only supports 32bit constants. – sudhackar Feb 02 '18 at 17:56
  • 1
    @sudhackar You can use a memory operand to pass an arbitrary 64 bit constant. – fuz Feb 02 '18 at 18:08
  • 1
    @sudhackar yes, `imm32` in Intel docs stands for "32 bit immediate". And it's even worse, it's SIGNED immediate, so it's more like only 31 bits. ... OP&all waiting: nope, so far no luck, those two bottom+top bits are tough cookies, I think 31 bits copied next to 31 bits are possible quite reasonably, but 32+32 so far refuse to cooperate. – Ped7g Feb 02 '18 at 18:27
  • 2
    Had you been able to use the stack but not been able to directly use _RSP_ in an instruction you could have done `rol eax, 16 push ax rol eax, 16 push ax rol eax, 16 push ax rol eax, 16 push ax pop rax` – Michael Petch Feb 02 '18 at 18:31

3 Answers3

6

My try... got headache from the music compo going on here at the demo party (or more likely from the travel here), so I gave up on the imul rax,rax,imm32 used to copy 31 bits and trying to save the 1 bit in sign and then patch the intermediate result, as it turned out to have several problems I didn't foresee.

So I instead opted to chicken-out, and copy only 2x 16 bit words, and then reshuffle in a way how Jester did (I was on a verge of posting idea of xchg al,ah just when he posted answer, I swear).

    ; rax =                           00001234 (bytes, not hexa digits)
    ror     rax, 16                 ; 34000012
    imul    rax, rax, 0x00010001    ; 34001212
    shr     rax, 16                 ; 00340012
    imul    rax, rax, 0x00010001    ; 34341212
    ror     rax, 24                 ; 21234341
    xchg    al, ah                  ; 21234314
    ror     rax, 8                  ; 42123431
    xchg    al, ah                  ; 42123413
    rol     rax, 16                 ; 12341342
    xchg    al, ah                  ; 12341324
    ror     rax, 8                  ; 41234132
    xchg    al, ah                  ; 41234123
    rol     rax, 8                  ; 12341234

A bit shorter (instruction count) variant (with a funny twist of using rol ...,8 instructions only since 5th one):

    ; eax =                           00001234 (bytes, not hexa digits)
    ror     rax, 8                  ; 40000123
    imul    rax, rax, 0x01000001    ; 40123123
    rol     rax, 16                 ; 12312340
    mov     al, ah                  ; 12312344
    rol     rax, 8                  ; 23123441
    rol     ax, 8                   ; 23123414
    rol     rax, 8                  ; 31234142
    rol     ax, 8                   ; 31234124
    rol     rax, 8                  ; 12341243
    rol     ax, 8                   ; 12341234
Ped7g
  • 16,236
  • 3
  • 26
  • 63
  • 1
    `ror ax,8` will suck less for performance on CPUs that rename `ah` separately but not AL / AX (i.e. [Haswell/Skylake](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to)). Even on Sandybridge, merging a dirty AX is cheaper than AH (which needs a separate cycle in the front-end to issue / rename the merge uop). Although if you care about code-size or performance, you wouldn't be doing this at all. (You'd use a memory constant for imul, or SSE2, or a scratch reg). – Peter Cordes Feb 02 '18 at 20:50
  • @PeterCordes thank you for the performance insight, I can't see what this may be good for except proof of possibility. For size even the 64b `imul` memory constant is better, and of course two registers even more I think, so I'm going to stop thinking about it now, was just enough fun. :) – Ped7g Feb 02 '18 at 20:52
  • I'm pretty sure that imul clobbered rdx. – Joshua Mar 15 '21 at 01:02
  • 1
    @Joshua: Only 1-operand mul/imul write RDX. https://www.felixcloutier.com/x86/imul. The multi-operand forms don't have any implicit operands, and are faster because they don't need an extra uop to write a 2nd output register. – Peter Cordes Mar 15 '21 at 01:07
  • 1
    @Ped7g: [stepan suggested in comments](https://stackoverflow.com/questions/48587643/copying-eax-to-rax-higher-bits/48589368#comment117780633_48589368) that `0x100000001` factors into two values that fit in imm32, allowing `mov eax,eax` / `imul rax, 0x663d81` / `imul rax, 0x281`. This is by far the best way given the constraints (and without resorting to any "weird stuff" like memory constants, or xmm regs, or save/restore of another reg.) – Peter Cordes Mar 15 '21 at 03:48
5

There has to be a simpler way :-D

shl rax, 8   ; 00012340
mov al, ah   ; 00012344
bswap rax    ; 44321000
shr rax, 16  ; 00443210
mov al, ah   ; 00443211
ror rax, 8   ; 10044321
xchg ah, al  ; 10044312
rol rax, 8   ; 00443121
xchg ah, al  ; 00443112
shl rax, 8   ; 04431120
mov al, ah   ; 04431122
ror rax, 32  ; 11220443
xchg ah, al  ; 11220434
ror rax, 8   ; 41122043
xchg ah, al  ; 41122034
ror rax, 8   ; 44112203
mov ah, al   ; 44112233
ror rax, 8   ; 34411223
xchg ah, al  ; 34411232
rol rax, 16  ; 41123234
xchg ah, al  ; 41123243
ror rax, 8   ; 34112324
xchg ah, al  ; 34112342
rol rax, 24  ; 12342341
xchg ah, al  ; 12342314
ror rax, 8   ; 41234231
xchg ah, al  ; 41234213
ror rax, 8   ; 34123421
xchg ah, al  ; 34123412
ror rax, 16  ; 12341234
Jester
  • 56,577
  • 4
  • 81
  • 125
  • 1
    I like this. No stack, no memory, simply done with registers. I suspect this is much closer to a solution that the OPs professor is looking for. – Michael Petch Feb 02 '18 at 19:35
5

It's hard to do this efficiently without touching a 2nd register, as requested. The only suggestion in comments that won't suck for performance is using a 64-bit constant for imul:

   ; mov    eax,eax     ; if eax isn't already zero-extended into rax
imul      rax, [rel broadcast_low32_multiplier]

section .rodata
  broadcast_low32_multiplier:   dq  0x100000001

This has 1-per-clock throughput (and 3 cycle latency) on Intel Sandybridge-family CPUs, and on AMD Ryzen. But only one per 4 clocks on Bulldozer-family. (http://agner.org/optimize/, https://uops.info/)

You can't do this with imul r64, r64, imm32, because our constant has 33 bits. You can factor 0x100000001 into 0x663d81 * 0x281, and do it in 2 uops, and 6 cycle latency on CPUs with fast multipliers. (Thanks, @stepan)

     ;mov  ecx, eax       ; zero extend into a different reg with 0 latency, and imul from there, if not already zero-extended
 imul     rax, rax, 0x281
 imul     rax, rax, 0x663d81     ;     0x281 * 0x663d81 = 0x100000001

The other option to make a fully self-contained fragment of x86-64 machine code, without reference to an external constant (e.g. for shellcode, or because you don't expect the constant to be hot in cache) is to put the constant inline with the instructions and jump over it.

    ; mov eax,eax      ; if eax isn't already zero-extended into rax
 imul      rax, [rel broadcast_low32_multiplier]
 jmp       after_constant
   broadcast_low32_multiplier:   dq  0x100000001
after_constant:

From a correctness POV, there's no difference between this and instructions using immediates. NASM listing:

 1 00000000 480FAF0502000000          imul    rax, [rel broadcast_low32_multiplier]
 2 00000008 EB08                      jmp       after_constant
 3 0000000A 0100000001000000          broadcast_low32_multiplier:   dq  0x100000001
 4                                    after_constant:

It does do a data load from the same page as the code, but AFAIK it's impossible to make a page executable but not readable with page table settings, so there's no way for this to fail where pure instructions would work. It will use up a cache line in L1D cache, and a dTLB entry, as well as L1i cache and an iTLB entry, though. It may even miss in dTLB if the entry is only hot in iTLB, and will probably miss in L1d cache (but probably hit in L2 cache; most CPUs have some non-exclusive unified cache that instruction-fetch goes through. See this more about performance lack-of-benefits for mixing code and data.)


Store / reload: compact but high latency

Jester's suggestion in comments of 2 stores / 1 reload is compact, but will cause a store-forwarding stall on all CPUs except in-order Atom (pre Silvermont):

push rax
mov  [rsp+4], eax     ; overwrite high bytes
pop  rax              ; store-forwarding stall when a wide load covers 2 narrow stores

This may be the smallest total size version, especially if you have a stack frame that lets you use a rbp+disp8 addressing mode like [rbp-12], saving 1 byte vs. rsp-relative (which needs a SIB even with no index register).


The standard way is faster, using a 2nd register

(The "other" standard way would be mov rcx, 0x100000001 / imul rax, rcx, but a 64-bit immediate is large, and shift/OR is actually lower latency.)

It's likely going to be worth it to avoid needing to do exactly what you asked, if performance matters.

E.g. save/restore one of the other 15 general-purpose registers around whatever you're doing, so you can use it as a scratch register. Ideally around the whole function, so you can do this broadcast inside a loop using a scratch reg. Or just clobber without saving first, if it's something you can just reload or recalc quickly.

; rax = garbage:eax
push   rbx            ; save/restore if needed, preferably at the top of the function

mov    ebx, eax       ; clear high garbage with 32-bit zero extend
shl    rax, 32        ; clear low garbage via shifting
or     rax, rbx        ; or  lea rcx, [rax+rbx]  into a 3rd reg if you want

...
pop    rbx             ; preferably at the bottom of the function

This has 2 cycle latency (for rax), even without mov-elimination. We shift the original RAX in parallel with copying (with zero-extension) into RBX. If EAX was already zero-extended, you could shift the copy in RBX, if you were optimizing for a CPU with mov-elimination: Intel's implementation frees up mov-elimination tracking resources when you overwrite the mov destination. But that would make it higher latency for CPUs without mov-elimination, putting the mov on the critical path. (A microcode update for Ice Lake disables mov-elimination. /sigh.)

The push/pop introduces a store/reload into the critical path for RBX, so choose your scratch reg wisely if you actually have to do this around every broadcast, instead of just freeing up an extra register for a whole loop or function.

Or with BMI2, rorx rcx, rax, 32 / or rax, rcx gets the job done in only 2 uops. (The upper 32 bit of RAX must already be zeroed for this, unlike with the mov/shift).


Or use SSE2 (baseline for x86-64). If you can use xmm0 instead of rax in the first place:

; movd       xmm0, eax      ; instead of whatever was setting rax

punpckldq  xmm0, xmm0        ; [dcba] -> [bbaa]

; movq       rax, xmm0

1 cycle latency on most CPUs (except SlowShuffle Core2 / K8, where you could use pshuflw with the right imm8 to make this efficient), plus whatever extra cost for using xmm0 instead of rax.

Copying RAX to/from XMM0 is most of the cost here if you do it that way, like 4 to 6 cycle round trip depending on microarchitecture, on top of the shuffle. (e.g. uops.info's Zen2 latency tests include a movd/movq round trip measured at 6 cycles). That would clobber xmm0, which might or might not be ok.

But you could avoid that cost if you were using xmm0 the whole time for your integer value. You can do scalar integer math in an xmm register (ignoring what happens to upper bytes). Instructions like paddd / paddq, pslldq, pmuludq / pand are available. The main thing you can't do is use it in an addressing mode, but if you're broadcasting the low 32 bits this probably isn't an address or an index.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Why does this deserve a downvote for pointing out the X-Y problem, and suggesting SSE2, and providing an `imul` solution that obeys all the stated restrictions? (And with this update, also the unstated restriction that I think separate data is to be avoided, so it's a self-contained machine code sequence.) – Peter Cordes Feb 03 '18 at 08:44
  • 1
    No idea, I still find the `imul` with memory constant highly likely superior to everything else mentioned here (except using two registers of course) in production conditions. I just didn't mention it in my answer, because you already did, so I think your answer is vital addendum to make this question answered to a point where even I would find it informative, if I would have stumbled upon it by accident. – Ped7g Feb 03 '18 at 08:56
  • @Ped7g: Ok, cool, that's how I see it too. Yours and Jester's answers are sort of the "fun" way (so upvoted), mine is an attempt at a "real" way you might ever actually use. And BTW, `imul` with a memory constant beats even the copy+shift+or for throughput on Intel CPUs and Ryzen (1 per clock and only 1 uop), if there's surrounding code that doesn't bottleneck on port1. – Peter Cordes Feb 03 '18 at 09:03
  • 2
    0x100000001 == 0x663d81 * 0x281 so we could try `imul rax, rax, 0x663d81` + `imul rax, rax, 0x281` which may be better for cold code (when data constant is not in L1) :) – stepan Mar 14 '21 at 18:05
  • @stepan: if you want to post that as an answer, I'd upvote it. (I did already edit it into my answer, but it's a good enough idea that it's still worth its own answer; you deserve some rep for it.) – Peter Cordes Mar 15 '21 at 03:49
  • @PeterCordes oh, thanks, I think your answer is more than comprehensive here! – stepan Mar 15 '21 at 06:18