I'm writing a fixed-point (16Q16) algo that does division using the Newton–Raphson method outlined on Wikipedia. (Related SE question HERE.)
The first step requires logical right-shift by anywhere from 1-16 bits. My CPU is an 8-bit microcontroller, so does not have a barrel shifter; the hardware can only support shifting by 1 bit. (Related SE question HERE.) To shift by n
bits, one can use the single shift instruction n
times, however, this has significant worst case timing. This bad worst case becomes abysmal when compounded by the multi-byte nature of the shift. If it takes 1 cycle to shift 1 byte 1 time, we can quickly imagine the problem when needing to shift 16 bytes 16 times. Remember, this is just step 1.
The obvious optimization is to divide and conquer; manually compute for all powers of 2. The first, trivial case is shifts by 16 and 8, since this is just changing the memory index to the fixed-point number. Doing this it only takes ~4 cycles/instructions to shift 16 bytes by 16 bits, a 64x speed improvement over "single shift chaining."
The problem I'm having is pesky shifts by 4.
My intuition, as well as a few snippets and posts here and there, tells me that their exists an efficient way to combine nybble swap instructions with bit-masks and logical ops to create a sort of "nybble swap zipper effect" that could optimally shift an arbitrary length array by 4 bits. (Related SE question HERE. Links to answer 3)
I KNOW it can at least fundamentally be done, because I have a prof of concept using only SWAP, XOR, and AND instructions. I also know it can be done FASTER because what I have is one cycle faster (lol, yes, one) than a simple shift chain method. (see code box below) What I do not know is...
Can this be done with a time complexity much closer to one cycle per byte than to one cycle per bit using any kind of nybble swap byte mask trick?
Note: This is PIC18 ASM, but it's pretty obvious what's going on. Any suggestions on how this can be majorly improved would be an answer to this question. YES! I realize that it may very well be close to optimal already, but realize shifting by 4 is a recurring hot spot in a few pieces of code. I'm expecting to cut at least one instruction from each block. Cutting out two would be amazing.
; Shift the denominator -> 4 (19 cyc)
;------------------------------------------
SWAPF Denom+3, W, BANKED
MOVWF Denom+3, BANKED
ANDLW 0xF0
XORWF Denom+3, F, BANKED
SWAPF Denom+2, F, BANKED
XORWF Denom+2, F, BANKED
XORWF Denom+2, W, BANKED
ANDLW 0xF0
XORWF Denom+2, F, BANKED
SWAPF Denom+1, F, BANKED
XORWF Denom+1, F, BANKED
XORWF Denom+1, W, BANKED
ANDLW 0xF0
XORWF Denom+1, F, BANKED
SWAPF Denom+0, F, BANKED
XORWF Denom+0, F, BANKED
XORWF Denom+0, W, BANKED
ANDLW 0xF0
XORWF Denom+0, F, BANKED
;------------------------------------------