Your word description of the process is correct, but your pseudo-code illustration inaccurate and incomplete.
You need to copy the LSB of counter
before you clear it; otherwise you'll have lost the bits you need to reverse. You need to clear the LSB correctly, and you can reverse the LSB bits directly back into the counter LSB as follows:
// Copy counter LSB
uint8_t lsb = (uint8_t)(counter & 0xFFu) ;
// Clear counter LSB
counter &= 0xff00u ;
// Reverse LSB bits and mask into counter LSB
for( uint8_t mask = 0x80u;
mask != 0;
lsb >>= 1, mask >>= 1 )
{
counter |= ((lsb & 0x01u) != 0) ? mask : 0 ;
}
You should also use the stdint.h types uint16_t
and uint8_t
for this operation rather than relying on int
being any particular size - it will make the code more portable and testable on a system where int
is not 16 bits. And generally you should use unsigned types when performing bit-wise operations.
A somewhat faster method, though possibly requiring a little more ROM space is to use a look-up table. A 256 byte lookup table is rather cumbersome to generate and on an ATtiny rather prohibitive in terms of memory usage. Rather it can be done almost as efficiently using a 16 byte lookup as follows:
// Copy counter LSB
uint8_t lsb = (uint8_t)(counter & 0xFFu) ;
// Clear counter LSB
counter &= 0xff00u ;
static const uint8_t lookup[] = { 0x0, 0x8, 0x4, 0xC,
0x2, 0xA, 0x6, 0xE,
0x1, 0x9, 0x5, 0xD,
0x3, 0xB, 0x7, 0xF } ;
counter |= lookup[lsb & 0xf] << 4 | lookup[lsb >> 4] ;
You could even pack-the lookup table and use just 8 bytes (0x80, 0xC4 etc):
static const uint8_t lookup[] = { 0x80, 0xC4,
0xA2, 0xE6,
0x91, 0xD5,
0xB3, 0xF7 } ;
uint8_t msnib = ( lsb & 0x01 ) ? lookup[(lsb & 0xf) >> 1] >> 4 :
lookup[(lsb & 0xf) >> 1] & 0xf ;
uint8_t lsnib = ( lsb & 0x10 ) ? lookup[(lsb & 0xf0) >> 5] >> 4 :
lookup[(lsb & 0xf0) >> 5] & 0xf ;
counter |= (lsnib | msnib << 4) ;
But the reduction in look-up table size is not likely to be justified by the increase in code size for the resulting additional bit manipulation - and its just a bit "too clever" - it took be a while to get it right!
The first method has the advantage that it can be applied to an arbitrary number of bits. Both look-up table solutions can be extended to any word size that is a multiple of 4 bits without changing the look-up table size, so scales well.
Benchmarking
I tested each implementation at https://godbolt.org/ set to AVR GCC 4.6.4 using three different optimisation settings. The instruction count excludes function entry/exit code added to make it compilable, and represents just the instructions generated from the source code in this answer.
| | Instruction Count | |
|Algorithm | No Opt | -O3 | -Os | + Data (bytes)|
|----------|:------:|:---:|:---:|:-------------:|
| Loop | 38 | 88 | 23 | 0 |
| LookUp16 | 59 | 38 | 37 | 16 |
| LookUp8 | 137 | 65 | 62 | 8 |
The test says little about execution time, but if code size is critical the loop algorithm with space optimisation (-Os
) is probably the best choice.
The look-up table is no doubt faster regardless of optimisation level, and the the 16-byte look-up table with either optimisation may be a reasonable balance. For -O3
it is overall, smaller and faster than the 88-instruction unrolled loop. It also has the distinct advantage that the code size is far less variable with optimisation settings which can minimise surprises when switching between debug and release builds.
The 8-byte look-up has little merit perhaps other then being quite interesting.