How to reverse a byte

Question

I am currectly working on a project and it happens that I have to reverse the order of a byte. I am currently using AVR Studio Mega32 Microcontroller.

For example:

0000 0001 becomes 1000 0000
0001 0110 becomes 0110 1000
1101 1001 becomes 1001 1011

To start I have this:

ldi r20,0b00010110

What is the easiest way to reverse the byte so that r20 becomes 01101000?

have a look at http://stackoverflow.com/questions/4924060/reverse-a-byte-using-assembly-language there are a few suggestions how to solve it. You may also have a look at http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious for more algorithms how to reverse a byte (the code is in C but should be ported to asm easily) — dwalter, Mar 15 '13 at 10:48
Easy for whom? Personally I'd use (table16[orig & 15]<<4) | table16[(orig>>4)];` One can easily convert that to assembler and the tables can be generated with little bit of mental effort. Also the memory requirements and speed are in somewhat good balance. — Aki Suihkonen, Mar 15 '13 at 11:33
@AkiSuihkonen: table indexing is not cheap on AVR, and 4-bit shifts take a SWAP + mask. So it might be *barely* faster than a pure ALU version. I put this in an answer. — Peter Cordes, Jun 30 '18 at 10:09

Brett Hale · Answer 1 · 2018-06-30T08:26:36.557

3

Here's a snippet - it's written for the GNU toolchain (avr-gcc, binutils, avr-libc, etc) - but it should be easy to adapt:

static inline __attribute__ ((always_inline))
uint8_t avr_reverse_byte (uint8_t x)
{
    x = ((x & 0x55) << 1) | ((x & 0xaa) >> 1);
    x = ((x & 0x33) << 2) | ((x & 0xcc) >> 2);

    /* x = ((x & 0x0f) << 4) | ((x & 0xf0) >> 4); */

    __asm__ ("swap %0" : "=r" (x) : "0" (x)); /* swap nibbles. */

    return x;
}

So, not much of an improvement over the 'C' code, except for the final hi-lo nibble swap implemented with the swap instruction.

edited Jun 30 '18 at 08:26

answered Mar 17 '13 at 11:11

Brett Hale

21,653
2
61
90

AVR gcc recognizes the `x = ((x << 4) | (x >> 4))` rotate idiom with or without `&0x0f` / `&0xf0`, and compiles it to a `swap` instruction. You don't need inline asm. Ironically, it's the *rest* of the function that compiles horribly. For some reason, if you don't explicitly cast the operands of `|` in the first 2 expressions, gcc fails to optimize the `int` (from integer promotion) down to an actual 1-register operation and uses 2 regs https://godbolt.org/g/sYx1uP. (I haven't checked current gcc because Godbolt only has AVR gcc4.6.) – Peter Cordes Jun 30 '18 at 09:08

score 2 · Answer 2 · answered Mar 15 '13 at 15:50

I can't provide AVR code just now. But the general bit reversing technique is following:

abcd efgh   p
badc fehg   p = ((p and 0AAh) shr 1) or ((p shl 1) and 0AAh)
dcba hgfe   p = ((p and 033h) shr 2) or ((p shl 2) and 033h)
hgfe dcba   p = ((p and 00Fh) shr 4) or ((p shl 4) and 0F0h)

JimmyB · Answer 3 · 2013-03-19T10:28:58.673

Another easy way is to use the carry flag:

Repeat 8x:

lsl r20 ; shift one bit into the carry flag
ror r0  ; rotate carry flag into result

(Input in r20, output in r0, content of r20 destroyed; registers may be changed freely.)

This uses 16 instructions @ 2 bytes, 1 cycle each = 32 bytes of program memory and 16 cycles to reverse one byte when completely 'unrolled'. Wrapped in a loop, the code size can be reduced but execution time will increase.

score 1 · Answer 4 · answered Jun 30 '18 at 10:08

A 4-bit (16 entry) lookup table for the two halves of the byte looks like a good tradeoff (as @Aki points out in a comment).

AVR instructions are 2 bytes each, so a 16-byte table costs the same space as 8 instructions. (It turns out not to be worth it for speed or size, except maybe if you can align your array by 256 bytes to make indexing much cheaper than what gcc does.)

It would be possible to pack the LUT, using the high and low half of each byte. But that would cost more in indexing (using a branch on bit 4 of index to conditionally SWAP before masking) than it saves in table size (8 bytes = 4 instructions).

Let's compare what AVR GCC does. gcc4.6 has surprisingly terrible missed-optimizations when compiling Brett's code (actually promoting to int and not taking full advantage of truncating the result to uint8_t). Ironically it does optimize x<<4 | x>>4 into a rotate using the SWAP instruction. (AVR doesn't have a multiple-count rotate instruction, and normal rotates are rotate-through-carry. Shifts are only available with one count per instruction.)

#include <stdint.h>

uint8_t reverse_byte_alu (uint8_t x)
{
    uint8_t xeven = x & 0x55,  xodd = x & 0xaa;
    x = (xeven << 1) | (xodd >> 1);  // swap adjacent bit pairs

    xeven = x & 0x33,  xodd = x & 0xcc;
    x = (xeven << 2) | (xodd >> 2);  // swap adjacent 2-bit chunks

    x = ((x << 4) | (x >> 4));  // 4-bit rotate is recognized as SWAP
    return x;
}

compiles into this asm with gcc4.6 -O3 (Godbolt compiler explorer). I don't see any missed-optimizations.

reverse_byte_alu:
    mov r25,r24
    andi r25,lo8(85)
    lsl r25
    andi r24,lo8(-86)
    lsr r24
    or r25,r24              # swap of adjacent bits done

    mov r24,r25
    andi r24,lo8(51)
    lsl r24
    lsl r24                 # << 2
    andi r25,lo8(-52)
    lsr r25
    lsr r25                 # >> 2
    or r24,r25              # swap pairs of bits done

    swap r24                # swap nibbles
    ret

16 instructions, 2 bytes each, all of them 1-cycle. (except ret) https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_instruction_list.html

So this is slightly better than @JimmyB's answer, which needs 16 single-cycle instructions not including a ret. (But it can be rolled up into a small loop).

Array indexing is not cheap in AVR. The only choice of addressing mode is post-increment or pre-decrement, no immediate displacements. So the 16-bit array address has to be added to the 4-bit nibble values. If your array is in the low 256 bytes of address space, this could be cheaper. Or if your array is 256-byte aligned, you could just set the upper byte of a pointer register and put your lookup value in the low byte. (gcc misses this optimization).

Telling gcc to align the array on a 16-byte boundary make the address calculation cheaper, but the total instruction count is 18, and some of them are multi-cycle instructions.

__attribute__((aligned(16)))  // makes indexing cheaper
static const uint8_t reverse_nibble[16] = {
        0,      0b1000, 0b0100, 0b1100,
        0b0010, 0b1010, 0b0110, 0b1110,
        0b0001, 0b1001, 0b0101, 0b1101,
        0b0011, 0b1011, 0b0111, 0b1111
        };

uint8_t reverse_byte_nibble_LUT (uint8_t x) {
    uint8_t hi = reverse_nibble[x>>4];
    hi = ((hi << 4) | (hi >> 4));  // SWAP instead of SWAP+AND for just a shift
    uint8_t lo = reverse_nibble[x & 0x0f];
    return hi | lo;
}

compiles to 17 instructions, and LD is a 2-cycle instruction when accessing FLASH with a non-increment/decrement addressing mode. (It's 1 cycle on some CPUs when not accessing flash or internal SRAM).

  # gcc4.6 output, not optimal
    mov r26,r24
    swap r26
    andi r26,lo8(15)
    ldi r27,lo8(0)
    subi r26,lo8(-(reverse_nibble))   # AVR doesn't have add-immediate byte, only sub
    sbci r27,hi8(-(reverse_nibble))   # X register = (x>>4) - (-LUT_base)
    ld r25,X
    swap r25
    mov r30,r24
    ldi r31,lo8(0)
    andi r30,lo8(15)
    andi r31,hi8(15)                 # missed opt, this is a double-redundant 0 & 0
    subi r30,lo8(-(reverse_nibble))
    sbci r31,hi8(-(reverse_nibble))
    ld r24,Z
    or r24,r25
    ret

ldi r27, 0 / sbci r27 is probably a missed optimization. With a 16-byte aligned table, we can't get carry into the high byte. I think we can do:

    # generate r30 = x&0x0f
    subi r30,lo8(-(reverse_nibble))   # ORI would work, too.  no-op with 256-byte aligned table
    ldi  r31,hi8(reverse_nibble)      # reuse this for hi and lo

So this might come out ahead on speed with better indexing, but total size (code + table) definitely looks worse.

I'm not sure when I settled on the snippet of code I gave in my answer, but I used it in an AVR LED driver a long time ago! The intermediate results are within their rights to be promoted to `int`, and gcc-4.6 is ancient now. I keep *very* up to date documentation for the [AVR-GNU toolchain](https://github.com/brettyhale/AVR-OSX-and-Olimex-ISP-MK2/blob/master/README.xcompile.20180518), so I'll have a look to see what **avr-gcc-8.1.0** can do, as soon as I get the motivation. — Brett Hale, Jul 18 '18 at 14:18
@BrettHale: I don't actually do any AVR development in real life, so I don't have an AVR compiler installed locally. It's an interesting architecture (the only one on Godbolt where `int` is smaller than a register), so it's an interesting corner case for the compiler and C's type-promotion rules. It really matters a lot that gcc carefully apply the as-if rule and optimize away promotion to `int`. I wish Godbolt had more up-to-date compilers for multiple non-x86 architectures, especially AVR because yeah 4.6 is way too old. — Peter Cordes, Jul 18 '18 at 15:26

Aki Suihkonen · Answer 5 · 2018-07-01T18:39:09.040

With a little bit of tweaking, one can get extra performance from the code snippet (of mine) in the comments.

When we ensure, that the LUT is aligned to 16-byte boundary, we can generate the address by xoring. Also we can make XOR the table by the index, allowing the argument x to be modified in-place. I've commented out the unnecessary instructions generated by GCC.

__attribute__((aligned(16)))  // makes indexing cheaper
static const uint8_t reverse_nibble_xor[16] = {
        0 ^ 0,  1 ^ 0b1000, 2 ^ 0b0100, 3 ^ 0b1100,
        4 ^ 0b0010, 5 ^ 0b1010, 6 ^ 0b0110, 7 ^ 0b1110,
        8 ^ 0b0001, 9 ^ 0b1001, 10 ^ 0b0101, 11 ^ 0b1101,
        12 ^ 0b0011, 13 ^ 0b1011, 14 ^ 0b0111, 15 ^ 0b1111
        };


uint8_t reverse_ams(uint8_t x)
{
    uint8_t *p = (uint8_t *)((uint16_t)reverse_nibble_xor ^ (x & 15));
    x ^= p[0];
    x = ((x << 4) | (x >> 4));
    uint8_t *q = (uint8_t *)((uint16_t)reverse_nibble_xor ^ (x & 15));
    x ^= q[0];
    return x;
}

reverse_ams:
        ldi r18,lo8(reverse_nibble)
//      ldi r19,hi8(reverse_nibble)
        ldi r31,hi8(reverse_nibble)  // use r31 directly instead of r19
        mov r30,r24
//        ldi r31,lo8(0)
        andi r30,lo8(15)
//        andi r31,hi8(15)
        eor r30,r18
//        eor r31,r19
        ld r25,Z
        eor r25,r24
        swap r25
        mov r30,r25
//        ldi r31,lo8(0)
        andi r30,lo8(15)
//        andi r31,hi8(15)
        eor r30,r18
//        eor r31,r19
        ld r24,Z
        eor r24,r25
        ret

"we can generate the address by xoring" - Why not just `add`? — JimmyB, Jul 02 '18 at 10:23
Why not indeed. Theoretically a compiler should know that bitwise operations do not produce carry, meaning that the low byte is isolated from the top byte. But the compiler doesn't understand that neighter with xoring or adding 0..15 to 0xABC0. — Aki Suihkonen, Jul 02 '18 at 12:09

How to reverse a byte

5 Answers5