How to perform right shift in Y86-64 (or other toy ISAs with ADD + AND but no native right shift)

Question

I'm trying to perform a right shift on Y86-64

To do a left shift I know that I need to multiply by 2^n where n is the number of bit shift we want for example if we want to shift by 4 it's 2^4 = 16 and do a loop of additions to perform multiplication on it but I'm not sure what to do for right shifts I think I need to perform division but not sure on how to approach that

pcount_do:
     movq $0, %rax
.L2: movq %rdi, %rdx
     shrq %rdi
     ret

The trivial way on the top of my head would be to actually copy bits (using `and`, `or` and a lookup table) from one register to another, shifted by the required amount. — Matteo Italia, Apr 05 '19 at 16:25
Think of it in terms of bits. 15 is 1111, right? What would that be if you shifted it one to the right? 0111 or 7. So shifting by one is the same as dividing by 2 (and dropping any remainder). What if we want to shift 15 by 2? That would be 0011 (or 3). That's like dividing by 4. — David Wohlferd, Apr 05 '19 at 17:33
I hadn't realized y86 was missing a right shift instruction. That's pretty terrible because it's hard to emulate. It doesn't have a division instruction either, so you probably need a loop and a lookup table like Matteo suggested. Are you sure you *need* a right shift? If you want to count bits, can't you shift them to the left side of the register (setting SF according to the MSB) with `add`? Oh, but Y86 doesn't have `js` / `jns`, so you'd compare against 0 and then `cmovl` or `jl` https://studylib.net/doc/8787646/y86-instruction-set-architecture-y86-arithmetic-and-logical — Peter Cordes, Apr 05 '19 at 21:06
@Peter Yea thats whats confusing me like i cant divide. Do you think a loop of subtractions could do the division?? — kidran, Apr 07 '19 at 21:02
Yes, division by repeated subtraction is always possible but horribly slow for large dividends. (Not a problem in "toy" programs on fast modern computers, though.) A computer with just a sub-and-branch instruction can be Turing-complete, but not *efficiently* programmable! There's a large difference between merely being able to compute something vs. being able to do it *efficiently*, if limited by a toy ISA that doesn't provide any key building blocks for certain basic operations that C has operators for, like `>>`, and which come up in real code for e.g. pointer subtraction. — Peter Cordes, Apr 08 '19 at 05:54
Given an O(1) or O(count) right shift, you can build an O(log n) division function (where `n` is the number of bits in the quotient, e.g. 64), or something like that. So it might be worth emulating a right shift in terms of a LUT for each byte separately, if you had to implement division that could handle inputs like `2^62 / 3` in less than say 12 years on a 4GHz computer (that can do 1 sub per clock cycle). Hardware division on a modern x86 does it in a few nanoseconds (Ryzen worst-case latency for `div r64` = 45 cycles = 4.5ns at 4GHz). Manually using right shift maybe 10x worse? — Peter Cordes, Apr 08 '19 at 06:01

Peter Cordes · Answer 1 · 2019-04-08T09:00:00.630

Like Matteo shows, you can loop one bit at a time, reading at one position and writing bits at another position.

Matteo's answer reads at a variable position by shifting a mask, and writing at a position that moves in lock-step, starting with the bottom of the register (shifting another mask).

It's easier to read the MSB of the input, then left-shift the input left-shift the input with add same,same and repeat. So we read bits starting with the top bit, and construct the result starting with its MSB. (We left shift 1 bit at a time into the destination with ADD to left shift, and a conditional add to set the new bit position or not.)

We can use a 2's complement signed comparison to read the top bit of a register. If it's set, x < 0, otherwise it's not.

x86 and y86 have a flag called SF that's set according to the MSB of the result (of an ALU operation). x86 has js / cmovs / sets instructions that check the SF condition directly. y86 only has jl / jge and other signed-compare conditions that check SF!=OF, so we need to do an extra compare against zero to clear OF (x - 0 can't overflow).

Or semantically, actually compare against zero instead of just reading SF. (Except we can optimize compare-against-zero into andl %eax,%eax or andq %rax,%rax, helpful if you're on a version of y86 that doesn't have sub-immediate. y86 is also missing x86's non-destructive test and cmp instructions that are like and and sub but only writing flags.)

32-bit y86 version tested on https://www.simn.me/js-y86/

Porting to y86-64 should be nearly trivial. (Change reg names, and 32 becomes 64).
Test case: 0x12345 >> 1 = 0x000091a2. (I don't see a way to permalink the code on that site, the way the Godbolt compiler explorer allows.)

   # constant input test case
    irmovl  0x12345, %eax
    #  irmovl  3, %ecx           # this could trivial handle variable counts, but doesn't.

# start of right-shift block:
# input: EAX = number to be shifted
# output: EDX =  number >> 1
# clobbers: EAX, ECX, EDI.   (EDI=1 to work around lack of add-immediate)

    xorl    %edx, %edx      # dst = 0.   like # irmovl  $0, %edx
    irmovl  1, %edi         # y86 is missing immediate add?

# shift 32-n bits from EAX into the bottom of EDX
# one at a time using SF to read them from the MSB
    irmovl  31, %ecx        # hard code count = 32 - 31
                            # or calculate this as 32 - count with neg / add or equivalent
rshift:                    # do {
    addl   %edx, %edx       # dst <<= 1

    andl   %eax, %eax       # compare against zero because y86 is missing js / cmovs that tests just SF
    jge   MSB_zero          # jge = jnl = not lower
    xorl    %edi,  %edx      # edx ^= 1.   y86 is missing OR?  low bit = 0 so we can ADD or XOR to set it
  MSB_zero:

    addl   %eax, %eax       # src <<= 1

    subl   %edi, %ecx
    jne   rshift            # }while(--ecx);  # semantically jnz


    halt # result in EDX
    #shr    $1, %eax

I used xor-zeroing because that y86 simulator assembles to variable-length machine code like x86. (So irmovl 0, %edx would be less efficient).

Or do the carry from MSB of EAX to LSB of EDX branchlessly with CMOVL

# loop body:
    addl       %edx, %edx      # dst <<= 1

    xorl       %esi, %esi      # esi = 0
    sub        %esi, %eax      # src -= 0  to set flags
    cmovl      %edi, %esi      # esi = (src<0) ? 1 : 0  = MSB of EAX
    addl       %esi, %edx      # copy the bit into EDX  (can't carry to higher bits)

    addl       %eax, %eax      # src <<= 1

If your y86 simulator simulates a performance penalty for branch mispredicts, use this. Otherwise branching is fewer instructions.

Or if you care about performance, it should be possible to use a lookup table for a whole byte at a time, with fixups across byte boundaries.

But with no left-shift to efficiently assemble the separate bytes, you'd need a separate 256-entry LUT of qwords for each byte-position! Or load from an offset and then mask off the "garbage" bytes.

Oh, and you need a right shift to extract bytes from a qword to feed the array indexing. If y86 can do byte loads, then you could store the input integer to memory and reload it 1 byte at a time. Or again emulate byte loads with unaligned qword loads and AND with 0x00...0FF to isolate that byte at the bottom of a register.

In fact, we can use store/reload with byte offsets and masking to "efficiently" do a right shift by a multiple of 8 bits in only a few instructions.

Oh crap, but we have a chicken/egg problem for runtime-variable counts. We need count / 8 as a byte offset, because there are 8 bits in a byte. But count is small, so we could use a repeated-subtraction loop for it. (You might want to AND with 0x3f or 0x1f (depending on operand-size) to wrap the count at 64 or 32, like x86 hardware shifts do. This will avoid indexing memory way outside the correct range with counts too large.)

Anyway, then you could extend this to handle right-shift counts that are not a multiple of 8 by rounding up (shifting out too many bits), then put the needed bits back in one at a time like the loop in the first part of the question. (After an unaligned load to get those bits at the top of a register.)

Or maybe use Matteo's method of walking a mask, using a LUT for start points. But if we're already doing a store/unaligned reload for byte shifting, another reload is probably good. We can calculate the right offset for this relative to the first unaligned reload, 4 or 8 bytes before so the starting MSB is the bit just below the lowest bit of the first load.

score 1 · Answer 2 · edited Aug 12 '19 at 06:28

1

Given that the instruction set of Y86 misses shifts and divisions, I'd go for something equivalent to this C code:

uint64_t lut[] = {
    1,
    2,
    4,
    8,
    16,
    32,
    64,
    128,
    256,
    512,
    1024,
    2048,
    4096,
    8192,
    16384,
    32768,
    65536,
    131072,
    262144,
    524288,
    1048576,
    2097152,
    4194304,
    8388608,
    16777216,
    33554432,
    67108864,
    134217728,
    268435456,
    536870912,
    1073741824,
    2147483648,
    4294967296,
    8589934592,
    17179869184,
    34359738368,
    68719476736,
    137438953472,
    274877906944,
    549755813888,
    1099511627776,
    2199023255552,
    4398046511104,
    8796093022208,
    17592186044416,
    35184372088832,
    70368744177664,
    140737488355328,
    281474976710656,
    562949953421312,
    1125899906842624,
    2251799813685248,
    4503599627370496,
    9007199254740992,
    18014398509481984,
    36028797018963968,
    72057594037927936,
    144115188075855872,
    288230376151711744,
    576460752303423488,
    1152921504606846976,
    2305843009213693952,
    4611686018427387904,
    9223372036854775808};

uint64_t rshift(uint64_t source, int amount) {
    uint64_t result = 0;
    for(int i = amount; i < 64; ++i) {
        if(source & lut[i]) result |= lut[i-amount];
    }
    return result;
}

This should all be doable with just add/sub/and/or plus a lookup table.

If we want to be smarter, as @PeterCordes suggests we can use an 8 entries lookup table and work on whole bytes, but that requires significantly more bookkeeping than looping over each bit.

--- update ---

@PeterCordes correctly notes that the lookup table is actually useless, as I'm looping over bits, so calculating the next power of two using a sum is trivial:

uint64_t rshift(uint64_t source, int amount) {
    uint64_t result = 0;
    uint64_t read_bit = 1;
    uint64_t write_bit = 1;
    for(int i = 0; i < amount; ++i) read_bit = read_bit + read_bit;
    for(int i = amount; i < 64; ++i) {
        if(source & read_bit) result |= write_bit;
        read_bit = read_bit + read_bit;
        write_bit = write_bit + write_bit;
    }
    return result;
}

edited Aug 12 '19 at 06:28

phuclv

37,963
15
156
475

answered Apr 08 '19 at 07:27

Matteo Italia

123,740
17
206
299

You don't actually need this LUT of powers of 2 since you're looping over bits. For shift by 1, you can start with `2` and `1` in separate registers, and walk those offset masks left 1 at a time (with add same,same=left shift). For any assemble-time-constant shift count, you would can merely calculate the starting read and write masks and loop bound. – Peter Cordes Apr 08 '19 at 07:35
Uh right, this was definitely stupid of me. TBH it's a bit difficult to reason about this stuff when you are missing the most basic building block of bit manipulation! – Matteo Italia Apr 08 '19 at 07:36
Yeah, I know! You never need this crap on a CPU that anyone would use for any real use. Y86 isn't too terrible compared to some toy ISAs like MARIE or LMC that don't even have bitwise operations like AND, just integer add. You can always shift any bit to the top (with an add same,same loop) where a signed-compare can check it. But anyway, forcing students to solve problems that are easy on real computers just seems silly. Asm makes it obvious why binary and bit-manipulation is important, so a toy ISA without those is just crazy. – Peter Cordes Apr 08 '19 at 07:41
Anyway, I had been thinking along the lines of using shifting bits to the left in the input operand, and shift into a fresh reg for 32-count iterations. You can test the MSB using a signed-compare, or if the machine has a carry flag and ADD/ADC that's even better. Now that I think about it in enough detail, that's very similar to your algorithm, but you shift a read-mask instead of shifting bits to the top. – Peter Cordes Apr 08 '19 at 07:44
1

@PeterCordes: yes, it's really stupid to devise such an unrealistic ISA, as the world _doesn't work like that!_ If you are aiming for an exercise in minimalism you may remove arithmetic instructions, that are actually built on logic ports, but that's completely unrealistic as well - even just addressing _requires_ arithmetic. When asked for this kind of stuff, typically my reply [is on the sarcastic side](https://stackoverflow.com/a/11694828/214671). – Matteo Italia Apr 08 '19 at 08:01
1

A minimal CPU (one-instruction ISA) implemented on a carbon nanotube was used as a proof-of-concept ([What is the minimum instruction set required for any Assembly language to be considered useful?](//stackoverflow.com/a/19677755)), but if you're going to have multiple instructions at all, yeah it doesn't make sense to leave out stuff that's trivial in hardware. – Peter Cordes Apr 08 '19 at 08:17

How to perform right shift in Y86-64 (or other toy ISAs with ADD + AND but no native right shift)

2 Answers2

32-bit y86 version tested on https://www.simn.me/js-y86/

In fact, we can use store/reload with byte offsets and masking to "efficiently" do a right shift by a multiple of 8 bits in only a few instructions.