Can "swap nybble" and "byte mask" tricks do multi byte logical shift by 4 much faster than the naive method of using bit shift chains

Question

I'm writing a fixed-point (16Q16) algo that does division using the Newton–Raphson method outlined on Wikipedia. (Related SE question HERE.)

The first step requires logical right-shift by anywhere from 1-16 bits. My CPU is an 8-bit microcontroller, so does not have a barrel shifter; the hardware can only support shifting by 1 bit. (Related SE question HERE.) To shift by n bits, one can use the single shift instruction n times, however, this has significant worst case timing. This bad worst case becomes abysmal when compounded by the multi-byte nature of the shift. If it takes 1 cycle to shift 1 byte 1 time, we can quickly imagine the problem when needing to shift 16 bytes 16 times. Remember, this is just step 1.

The obvious optimization is to divide and conquer; manually compute for all powers of 2. The first, trivial case is shifts by 16 and 8, since this is just changing the memory index to the fixed-point number. Doing this it only takes ~4 cycles/instructions to shift 16 bytes by 16 bits, a 64x speed improvement over "single shift chaining."

The problem I'm having is pesky shifts by 4.

My intuition, as well as a few snippets and posts here and there, tells me that their exists an efficient way to combine nybble swap instructions with bit-masks and logical ops to create a sort of "nybble swap zipper effect" that could optimally shift an arbitrary length array by 4 bits. (Related SE question HERE. Links to answer 3)

I KNOW it can at least fundamentally be done, because I have a prof of concept using only SWAP, XOR, and AND instructions. I also know it can be done FASTER because what I have is one cycle faster (lol, yes, one) than a simple shift chain method. (see code box below) What I do not know is...

Can this be done with a time complexity much closer to one cycle per byte than to one cycle per bit using any kind of nybble swap byte mask trick?

Note: This is PIC18 ASM, but it's pretty obvious what's going on. Any suggestions on how this can be majorly improved would be an answer to this question. YES! I realize that it may very well be close to optimal already, but realize shifting by 4 is a recurring hot spot in a few pieces of code. I'm expecting to cut at least one instruction from each block. Cutting out two would be amazing.

; Shift the denominator -> 4 (19 cyc)
;------------------------------------------
        SWAPF   Denom+3, W, BANKED
        MOVWF   Denom+3, BANKED
        ANDLW   0xF0
        XORWF   Denom+3, F, BANKED

        SWAPF   Denom+2, F, BANKED
        XORWF   Denom+2, F, BANKED
        XORWF   Denom+2, W, BANKED
        ANDLW   0xF0
        XORWF   Denom+2, F, BANKED

        SWAPF   Denom+1, F, BANKED
        XORWF   Denom+1, F, BANKED
        XORWF   Denom+1, W, BANKED
        ANDLW   0xF0
        XORWF   Denom+1, F, BANKED

        SWAPF   Denom+0, F, BANKED
        XORWF   Denom+0, F, BANKED
        XORWF   Denom+0, W, BANKED
        ANDLW   0xF0
        XORWF   Denom+0, F, BANKED
;------------------------------------------

The `movwf` command doesn't take a F/W destination, it always moves W into a register. Could you clarify what you are trying to do in the second line? — David Grayson, Feb 21 '21 at 04:12
"The movwf command doesn't take a F/W destination," I have no Idea why I ended up doing that. Point of fact, it wasn't like that when I first posted and I guess I edited it to how it is now?! It's been edited back to what it was. — Charlie, Feb 21 '21 at 09:05

score 3 · Answer 1 · answered Feb 21 '21 at 18:11

Your solution involves 19 instructions (which each take one cycle) but the simple solution just takes 18:

RRCF   Denom+3, F, BANKED
RRCF   Denom+2, F, BANKED
RRCF   Denom+1, F, BANKED
RRCF   Denom+0, F, BANKED

RRCF   Denom+3, F, BANKED
RRCF   Denom+2, F, BANKED
RRCF   Denom+1, F, BANKED
RRCF   Denom+0, F, BANKED

RRCF   Denom+3, F, BANKED
RRCF   Denom+2, F, BANKED
RRCF   Denom+1, F, BANKED
RRCF   Denom+0, F, BANKED

RRCF   Denom+3, F, BANKED
RRCF   Denom+2, F, BANKED
RRCF   Denom+1, F, BANKED
RRCF   Denom+0, F, BANKED

MOVLW  0x0F
ANDWF  Denom+3, F, BANKED

I can't think of anything faster than that which actually performs this shift.

Interesting. So maybe swap shift isn't going to be faster after all.... (:/) I'm not at my code right now, so I can't see what I did at first, but I think I was clearing the carry flag each loop, which added one instruction. I was getting 20 cycles this way. Your way of using a shift chain is faster though because it ignores what was in the carry, choosing to deal with all those bits after the shifts. — Charlie, Feb 21 '21 at 22:44

Can "swap nybble" and "byte mask" tricks do multi byte logical shift by 4 much faster than the naive method of using bit shift chains

Can this be done with a time complexity much closer to one cycle per byte than to one cycle per bit using any kind of nybble swap byte mask trick?

1 Answers1