Like Matteo shows, you can loop one bit at a time, reading at one position and writing bits at another position.
Matteo's answer reads at a variable position by shifting a mask, and writing at a position that moves in lock-step, starting with the bottom of the register (shifting another mask).
It's easier to read the MSB of the input, then left-shift the input left-shift the input with add same,same
and repeat. So we read bits starting with the top bit, and construct the result starting with its MSB. (We left shift 1 bit at a time into the destination with ADD to left shift, and a conditional add to set the new bit position or not.)
We can use a 2's complement signed comparison to read the top bit of a register. If it's set, x < 0
, otherwise it's not.
x86 and y86 have a flag called SF that's set according to the MSB of the result (of an ALU operation). x86 has js
/ cmovs
/ sets
instructions that check the SF
condition directly. y86 only has jl
/ jge
and other signed-compare conditions that check SF!=OF
, so we need to do an extra compare against zero to clear OF (x - 0
can't overflow).
Or semantically, actually compare against zero instead of just reading SF. (Except we can optimize compare-against-zero into andl %eax,%eax
or andq %rax,%rax
, helpful if you're on a version of y86 that doesn't have sub-immediate. y86 is also missing x86's non-destructive test
and cmp
instructions that are like and
and sub
but only writing flags.)
Porting to y86-64 should be nearly trivial. (Change reg names, and 32 becomes 64).
Test case: 0x12345 >> 1 = 0x000091a2
. (I don't see a way to permalink the code on that site, the way the Godbolt compiler explorer allows.)
# constant input test case
irmovl 0x12345, %eax
# irmovl 3, %ecx # this could trivial handle variable counts, but doesn't.
# start of right-shift block:
# input: EAX = number to be shifted
# output: EDX = number >> 1
# clobbers: EAX, ECX, EDI. (EDI=1 to work around lack of add-immediate)
xorl %edx, %edx # dst = 0. like # irmovl $0, %edx
irmovl 1, %edi # y86 is missing immediate add?
# shift 32-n bits from EAX into the bottom of EDX
# one at a time using SF to read them from the MSB
irmovl 31, %ecx # hard code count = 32 - 31
# or calculate this as 32 - count with neg / add or equivalent
rshift: # do {
addl %edx, %edx # dst <<= 1
andl %eax, %eax # compare against zero because y86 is missing js / cmovs that tests just SF
jge MSB_zero # jge = jnl = not lower
xorl %edi, %edx # edx ^= 1. y86 is missing OR? low bit = 0 so we can ADD or XOR to set it
MSB_zero:
addl %eax, %eax # src <<= 1
subl %edi, %ecx
jne rshift # }while(--ecx); # semantically jnz
halt # result in EDX
#shr $1, %eax
I used xor-zeroing because that y86 simulator assembles to variable-length machine code like x86. (So irmovl 0, %edx
would be less efficient).
Or do the carry from MSB of EAX to LSB of EDX branchlessly with CMOVL
# loop body:
addl %edx, %edx # dst <<= 1
xorl %esi, %esi # esi = 0
sub %esi, %eax # src -= 0 to set flags
cmovl %edi, %esi # esi = (src<0) ? 1 : 0 = MSB of EAX
addl %esi, %edx # copy the bit into EDX (can't carry to higher bits)
addl %eax, %eax # src <<= 1
If your y86 simulator simulates a performance penalty for branch mispredicts, use this. Otherwise branching is fewer instructions.
Or if you care about performance, it should be possible to use a lookup table for a whole byte at a time, with fixups across byte boundaries.
But with no left-shift to efficiently assemble the separate bytes, you'd need a separate 256-entry LUT of qwords for each byte-position! Or load from an offset and then mask off the "garbage" bytes.
Oh, and you need a right shift to extract bytes from a qword to feed the array indexing. If y86 can do byte loads, then you could store the input integer to memory and reload it 1 byte at a time. Or again emulate byte loads with unaligned qword loads and AND with 0x00...0FF
to isolate that byte at the bottom of a register.
In fact, we can use store/reload with byte offsets and masking to "efficiently" do a right shift by a multiple of 8 bits in only a few instructions.
Oh crap, but we have a chicken/egg problem for runtime-variable counts. We need count / 8
as a byte offset, because there are 8 bits in a byte. But count is small, so we could use a repeated-subtraction loop for it. (You might want to AND
with 0x3f or 0x1f (depending on operand-size) to wrap the count at 64 or 32, like x86 hardware shifts do. This will avoid indexing memory way outside the correct range with counts too large.)
Anyway, then you could extend this to handle right-shift counts that are not a multiple of 8 by rounding up (shifting out too many bits), then put the needed bits back in one at a time like the loop in the first part of the question. (After an unaligned load to get those bits at the top of a register.)
Or maybe use Matteo's method of walking a mask, using a LUT for start points. But if we're already doing a store/unaligned reload for byte shifting, another reload is probably good. We can calculate the right offset for this relative to the first unaligned reload, 4 or 8 bytes before so the starting MSB is the bit just below the lowest bit of the first load.