A 4-bit (16 entry) lookup table for the two halves of the byte looks like a good tradeoff (as @Aki points out in a comment).
AVR instructions are 2 bytes each, so a 16-byte table costs the same space as 8 instructions. (It turns out not to be worth it for speed or size, except maybe if you can align your array by 256 bytes to make indexing much cheaper than what gcc does.)
It would be possible to pack the LUT, using the high and low half of each byte. But that would cost more in indexing (using a branch on bit 4 of index to conditionally SWAP before masking) than it saves in table size (8 bytes = 4 instructions).
Let's compare what AVR GCC does. gcc4.6 has surprisingly terrible missed-optimizations when compiling Brett's code (actually promoting to int
and not taking full advantage of truncating the result to uint8_t). Ironically it does optimize x<<4 | x>>4
into a rotate using the SWAP instruction. (AVR doesn't have a multiple-count rotate instruction, and normal rotates are rotate-through-carry. Shifts are only available with one count per instruction.)
#include <stdint.h>
uint8_t reverse_byte_alu (uint8_t x)
{
uint8_t xeven = x & 0x55, xodd = x & 0xaa;
x = (xeven << 1) | (xodd >> 1); // swap adjacent bit pairs
xeven = x & 0x33, xodd = x & 0xcc;
x = (xeven << 2) | (xodd >> 2); // swap adjacent 2-bit chunks
x = ((x << 4) | (x >> 4)); // 4-bit rotate is recognized as SWAP
return x;
}
compiles into this asm with gcc4.6 -O3
(Godbolt compiler explorer). I don't see any missed-optimizations.
reverse_byte_alu:
mov r25,r24
andi r25,lo8(85)
lsl r25
andi r24,lo8(-86)
lsr r24
or r25,r24 # swap of adjacent bits done
mov r24,r25
andi r24,lo8(51)
lsl r24
lsl r24 # << 2
andi r25,lo8(-52)
lsr r25
lsr r25 # >> 2
or r24,r25 # swap pairs of bits done
swap r24 # swap nibbles
ret
16 instructions, 2 bytes each, all of them 1-cycle. (except ret
) https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_instruction_list.html
So this is slightly better than @JimmyB's answer, which needs 16 single-cycle instructions not including a ret
. (But it can be rolled up into a small loop).
Array indexing is not cheap in AVR. The only choice of addressing mode is post-increment or pre-decrement, no immediate displacements. So the 16-bit array address has to be added to the 4-bit nibble values. If your array is in the low 256 bytes of address space, this could be cheaper. Or if your array is 256-byte aligned, you could just set the upper byte of a pointer register and put your lookup value in the low byte. (gcc misses this optimization).
Telling gcc to align the array on a 16-byte boundary make the address calculation cheaper, but the total instruction count is 18, and some of them are multi-cycle instructions.
__attribute__((aligned(16))) // makes indexing cheaper
static const uint8_t reverse_nibble[16] = {
0, 0b1000, 0b0100, 0b1100,
0b0010, 0b1010, 0b0110, 0b1110,
0b0001, 0b1001, 0b0101, 0b1101,
0b0011, 0b1011, 0b0111, 0b1111
};
uint8_t reverse_byte_nibble_LUT (uint8_t x) {
uint8_t hi = reverse_nibble[x>>4];
hi = ((hi << 4) | (hi >> 4)); // SWAP instead of SWAP+AND for just a shift
uint8_t lo = reverse_nibble[x & 0x0f];
return hi | lo;
}
compiles to 17 instructions, and LD is a 2-cycle instruction when accessing FLASH with a non-increment/decrement addressing mode. (It's 1 cycle on some CPUs when not accessing flash or internal SRAM).
# gcc4.6 output, not optimal
mov r26,r24
swap r26
andi r26,lo8(15)
ldi r27,lo8(0)
subi r26,lo8(-(reverse_nibble)) # AVR doesn't have add-immediate byte, only sub
sbci r27,hi8(-(reverse_nibble)) # X register = (x>>4) - (-LUT_base)
ld r25,X
swap r25
mov r30,r24
ldi r31,lo8(0)
andi r30,lo8(15)
andi r31,hi8(15) # missed opt, this is a double-redundant 0 & 0
subi r30,lo8(-(reverse_nibble))
sbci r31,hi8(-(reverse_nibble))
ld r24,Z
or r24,r25
ret
ldi r27, 0
/ sbci r27
is probably a missed optimization. With a 16-byte aligned table, we can't get carry into the high byte. I think we can do:
# generate r30 = x&0x0f
subi r30,lo8(-(reverse_nibble)) # ORI would work, too. no-op with 256-byte aligned table
ldi r31,hi8(reverse_nibble) # reuse this for hi and lo
So this might come out ahead on speed with better indexing, but total size (code + table) definitely looks worse.