13

I need to rewrite about 4KB of data in reverse order, at bit level (last bit of last byte becoming first bit of first byte), as fast as possible. Are there any clever sniplets to do it?

Rationale: The data is display contents of LCD screen in an embedded device that is usually positioned in a way that the screen is on your shoulders level. The screen has "6 o'clock" orientation, that is to be viewed from below - like lying flat or hanging above your eyes level. This is fixable by rotating the screen 180 degrees, but then I need to reverse the screen data (generated by library), which is 1 bit = 1 pixel, starting with upper left of the screen. The CPU isn't very powerful, and the device has enough work already, plus several frames a second would be desirable so performance is an issue; RAM not so much.

edit: Single core, ARM 9 series. 64MB, (to be scaled down to 32MB later), Linux. The data is pushed from system memory to the LCD driver over 8-bit IO port.

The CPU is 32bit and performs much better at this word size than at byte level.

SF.
  • 13,549
  • 14
  • 71
  • 107
  • I've found that the lcd display sites typically have good routines for writing data to their modules (and in both directions). Also, as this is a common issue, hobbyist (and professional) boards for your processor probably also have algorithms to do this. – KevinDTimm Jan 28 '10 at 10:20
  • 2
    One thing, you could re-arrange the bits inside a byte by using a lookup-table, e.g. `00101101 -> 10110100`. The rest of the problem would then be reduced to quickly reverse the buffer at the byte level. – stakx - no longer contributing Jan 28 '10 at 10:20
  • You should really do it on-the-fly when sending the data to the LCD, not as a separate pre-pass. See [my answer](http://stackoverflow.com/a/16535315/414813) below... – CAFxX May 14 '13 at 10:38

9 Answers9

23

There's a classic way to do this. Let's say unsigned int is your 32-bit word. I'm using C99 because the restrict keyword lets the compiler perform extra optimizations in this speed-critical code that would otherwise be unavailable. These keywords inform the compiler that "src" and "dest" do not overlap. This also assumes you are copying an integral number of words, if you're not, then this is just a start.

I also don't know which bit shifting / rotation primitives are fast on the ARM and which are slow. This is something to consider. If you need more speed, consider disassembling the output from the C compiler and going from there. If using GCC, try O2, O3, and Os to see which one is fastest. You might reduce stalls in the pipeline by doing two words at the same time.

This uses 23 operations per word, not counting load and store. However, these 23 operations are all very fast and none of them access memory. I don't know if a lookup table would be faster or not.

void
copy_rev(unsigned int *restrict dest,
         unsigned int const *restrict src,
         unsigned int n)
{
    unsigned int i, x;
    for (i = 0; i < n; ++i) {
        x = src[i];
        x = (x >> 16) | (x << 16);
        x = ((x >> 8) & 0x00ff00ffU) | ((x & 0x00ff00ffU) << 8);
        x = ((x >> 4) & 0x0f0f0f0fU) | ((x & 0x0f0f0f0fU) << 4);
        x = ((x >> 2) & 0x33333333U) | ((x & 0x33333333U) << 2);
        x = ((x >> 1) & 0x55555555U) | ((x & 0x555555555) << 1);
        dest[n-1-i] = x;
    }
}

This page is a great reference: http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious

Final note: Looking at the ARM assembly reference, there is a "REV" opcode which reverses the byte order in a word. This would shave 7 operations per loop off the above code.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • On my commodity Hazwell box, NOT the intended target to be sure, but what I have, this is about 3X as fast as my very literal interpretation of a solution I provided below. A conditional lookup table would probably be the fastest solution, especially if the bytes were swapped only if they were not identical so that actually doing the swap would be entirely redundant. –  Feb 12 '15 at 20:31
  • I'm skeptical about both lookup tables and the idea of somehow not swapping bytes that are identical. Conditional moves introduce a *rather enormous* amount of overhead, and pipelines have reduced the effectiveness of tables. On the other hand, when I compile the code above, it gets turned into SSE2 instructions automatically by the compiler (Apple LLVM 6.0), using 54 instructions in the inner loop. Performance numbers: 2100 MB/s for `memcpy()`, 630 MB/s for the above code, and 370 MB/s for a lookup table. (Apple LLVM 6.0, `-O3`, i5-4258U CPU @ 2.40GHz, 100M integers x 10 runs) – Dietrich Epp Feb 12 '15 at 22:54
  • @RocketRoy: Here is the code: https://gist.github.com/depp/00425610e265c93d5c2c — my numbers were off by a factor of 4 because I forgot to multiply by `sizeof(unsigned)` when reporting speed. – Dietrich Epp Feb 12 '15 at 22:59
  • Agree with all said Dietrich, but too close to call, so have to benchmark. I recently wrote something here at SO for Circular Identities, and found an inner-loop conditional branch was cheaper than int mult by zero. Surprising. On the other hand, my int32 to string code, also here at SO, easily beat, by almost 3X, Voight's coding contest results using lookup tables. My general RX is to favor the most register-intensive, simple, brute force method possible, as it may well be the fastest, and then tweak from there if you still need more. BTW, thanks for your contributions here. –  Feb 13 '15 at 19:26
16

Fastest way would probably to store the reverse of all possible byte values in a look-up table. The table would take only 256 bytes.

Maurits Rijk
  • 9,789
  • 2
  • 36
  • 53
  • 1
    ...then just change the loop that writes the bytes out to the IO port to start at the end of the buffer and work backwards. – caf Jan 28 '10 at 11:54
14

Build a 256 element lookup table of byte values that are bit-reversed from their index.

{0x00, 0x80, 0x40, 0xc0, etc}

Then iterate through your array copying using each byte as an index into your lookup table.

If you are writing assembly language, the x86 instruction set has an XLAT instruction that does just this sort of lookup. Although it may not actually be faster than C code on modern processors.

You can do this in place if you iterate from both ends towards the middle. Because of cache effects, you may find it's faster to swap in 16 byte chunks (assuming a 16 byte cache line).

Here's the basic code (not including the cache line optimization)

// bit reversing lookup table
typedef unsigned char BYTE;
extern const BYTE g_RevBits[256];

void ReverseBitsInPlace(BYTE * pb, int cb)
{
    int iter = cb/2;
    for (int ii = 0, jj = cb-1; ii < iter; ++ii, --jj)
    {
        BYTE b1 = g_RevBits[pb[ii]];
        pb[ii] = g_RevBits[pb[jj]];
        pb[jj] = b1;
    }

    if (cb & 1) // if the number of bytes was odd, swap the middle one in place
    {
       pb[cb/2] = g_RevBits[pb[cb/2]];
    }
}

// initialize the bit reversing lookup table using macros to make it less typing.
#define BITLINE(n) \
   0x0##n, 0x8##n, 0x4##n, 0xC##n, 0x2##n, 0xA##n, 0x6##n, 0xE##n,\
   0x1##n, 0x9##n, 0x5##n, 0xD##n, 0x3##n, 0xB##n, 0x7##n, 0xF##n,

const BYTE g_RevBits[256] = {
  BITLINE(0), BITLINE(8), BITLINE(4), BITLINE(C), 
  BITLINE(2), BITLINE(A), BITLINE(6), BITLINE(E), 
  BITLINE(1), BITLINE(9), BITLINE(5), BITLINE(D), 
  BITLINE(3), BITLINE(B), BITLINE(7), BITLINE(F), 
  };
John Knoeller
  • 33,512
  • 4
  • 61
  • 92
9

The Bit Twiddling Hacks site is alwas a good starting point for these kind of problems. Take a look here for fast bit reversal. Then its up to you to apply it to each byte/word of your memory block.

EDIT:

Inspired by Dietrich Epps answer and looking at the ARM instruction set, there is a RBIT opcode that reverses the bits contained in a register. So if performance is critical, you might consider using some assembly code.

Frank Bollack
  • 24,478
  • 5
  • 49
  • 58
  • The URL for Bit Twiddling hacks is broken - it should be: http://graphics.stanford.edu/~seander/bithacks.html – Paul R Jan 28 '10 at 10:27
3

enter image description here

It looks like this code takes about 50 clocks per bit swap on my i7 XPS 8500 machine. 7.6 seconds for a million array flips. Single threaded. It prints some ASCI art based on patterns of 1s and 0s. I rotated the pic left 180 degrees after reversing the bit array, using a graphic editor, and they look identical to me. A double-reversed image comes out the same as the original.

As for pluses, it's a complete solution. It swaps bits from the back of a bit array to the front, vs operating on ints/bytes and then needing to swap ints/bytes in an array.

Also, this is a general purpose bit library, so you might find it handy in the future for solving other, more mundane problems.

Is it as fast as the accepted answer? I think it's close, but without working code to benchmark it's impossible to say. Feel free to cut and paste this working program.

// Reverse BitsInBuff.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include "time.h"
#include "memory.h"
//
//  Manifest constants
#define  uchar unsigned char
#define  BUFF_BYTES 510 //400 supports a display of 80x40 bits
#define  DW 80  //  Display Width
// ----------------------------------------------------------------------------
uchar   mask_set[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };
uchar   mask_clr[] = { 0xfe, 0xfd, 0xfb, 0xf7, 0xef, 0xdf, 0xbf, 0x7f };
//
// Function Prototypes
static void PrintIntBits(long x, int bits);
void    BitSet(uchar * BitArray, unsigned long BitNumber);
void    BitClr(uchar * BitArray, unsigned long BitNumber);
void    BitTog(uchar * BitArray, unsigned long BitNumber);
uchar   BitGet(uchar * BitArray, unsigned long BitNumber);
void    BitPut(uchar * BitArray, unsigned long BitNumber, uchar value);
//
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt);
static void PrintIntBits(long x, int bits);
// -----------------------------------------------------------------------------
// Reverse the bit ordering in an array
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt)  {
    unsigned long front=0, back = BitKnt-1;
    uchar temp;
    while( front<back ) {
        temp = BitGet(Buff, front);                 // copy front bit to temp before overwriting 
        BitPut(Buff, front, BitGet(Buff, back));    // copy back bit to front bit
        BitPut(Buff, back, temp);                   // copy saved value of front in temp to back of bit arra)
        front++;
        back--;
    }
    return Buff;
}
//  ---------------------------------------------------------------------------
//  ---------------------------------------------------------------------------
int _tmain(int argc, _TCHAR* argv[])    {
    int i, j, k, LoopKnt = 1000001;
    time_t start;
    uchar Buff[BUFF_BYTES];
    memset(Buff, 0, sizeof(Buff));
    //  make an ASCII art picture
    for(i=0, k=0; i<(sizeof(Buff)*8)/DW; i++)   {
        for(j=0; j<DW/2; j++)   {
            BitSet(Buff, (i*DW)+j+k);
        }
        k++;
    }
    //  print ASCII art picture
    for(i=0; i<sizeof(Buff); i++)   {
        if(!(i % 10)) printf("\n"); // print bits in blocks of 80
        PrintIntBits(Buff[i], 8);
    }
    i=LoopKnt;
    start = clock();
    while( i-- )    {
        ReverseBitsInArray((uchar *)Buff, BUFF_BYTES * 8);
    }
    //  print ASCII art pic flipped upside-down and rotated left
    printf("\nMilliseconds elapsed = %d", clock() - start);
    for(i=0; i<sizeof(Buff); i++)   {
        if(!(i % 10)) printf("\n"); // print bits in blocks of 80
        PrintIntBits(Buff[i], 8);
    }
    printf("\n\nBenchmark time for %d loops\n", LoopKnt);
    getchar();
    return 0;
}
// -----------------------------------------------------------------------------
//  Scaffolding...
static void PrintIntBits(long x, int bits)  {
    unsigned long long z=1;
    int i=0;
    z = z << (bits-1);
    for (; z > 0; z >>= 1)  {
        printf("%s", ((x & z) == z) ? "#" : ".");
    }
}
//  These routines do bit manipulations on a bit array of unsigned chars
//  ---------------------------------------------------------------------------
void BitSet(uchar *buff, unsigned long BitNumber)   {
    buff[BitNumber >> 3] |= mask_set[BitNumber & 7];
}
//  ---------------------------------------------------------------------------- 
void BitClr(uchar *buff, unsigned long BitNumber)   {
    buff[BitNumber >> 3] &= mask_clr[BitNumber & 7];
}
//  ---------------------------------------------------------------------------- 
void BitTog(uchar *buff, unsigned long BitNumber)   {
    buff[BitNumber >> 3] ^= mask_set[BitNumber & 7];
}
//  ---------------------------------------------------------------------------- 
uchar BitGet(uchar *buff, unsigned long BitNumber)  {
    return (uchar)  ((buff[BitNumber >> 3] >> (BitNumber & 7)) & 1);
}
//  ---------------------------------------------------------------------------- 
void BitPut(uchar *buff, unsigned long BitNumber, uchar value)  {
    if(value)   {   // if the bit at buff[BitNumber] is true.
        BitSet(buff, BitNumber);
    }   else    {
        BitClr(buff, BitNumber);
    }
}

Below is the code listing for an optimization using a new buffer, instead of swapping bytes in place. Given that only 2030:4080 BitSet()s are needed because of the if() test, and about half the GetBit()s and PutBits() are eliminated by eliminating TEMP, I suspect memory access time is a large, fixed cost to these kinds of operations, providing a hard limit to optimization.

Using a look-up approach, and CONDITIONALLY swapping bytes, rather than bits, reduces by a factor of 8 the number of memory accesses, and testing for a 0 byte gets amortized across 8 bits, rather than 1.

Using these two approaches together, testing to see if the entire 8-bit char is 0 before doing ANYTHING, including the table lookup, and the write, is likely going to be the fastest possible approach, but would require an extra 512 bytes for the new, destination bit array, and 256 bytes for the lookup table. The performance payoff might be quite dramatic though.

// -----------------------------------------------------------------------------
// Reverse the bit ordering in new array
uchar *ReverseBitsInNewArray(uchar *Dst, const uchar *Src, const int BitKnt)    {
    int front=0, back = BitKnt-1;
    memset(Dst, 0, BitKnt/BitsInByte);

    while( front < back )   {
        if(BitGet(Src, back--)) {   // memset() has already set all bits in Dst to 0, 
            BitSet(Dst, front);     // so only reset if Src bit is 1
        }
        front++;
    }
    return Dst;
Hovercraft Full Of Eels
  • 283,665
  • 25
  • 256
  • 373
  • Faster than swapping in-place, would be to SET a bit, in reverse order, in a 2nd, memset( 0 ) bit array - CONDITIONAL on the source array's value being a 1. This would eliminate the need for setting the value of TEMP, and, on average, save all the 0 BitPut()s to the destination array. Given the lookup approach, which does have some appeal, burns 256 bytes, and the OPs 4096 bit array only 512, burning 256 more bytes(or whatever the display buffer's size is) seems like a better, more direct approach, with 2-3X my existing code's performance. –  May 14 '13 at 22:43
  • 1
    Somewhat disappointed, but the above CONDITIONAL setting of destination bits, depending on whether the source array's bit was true, runs in 60% of the time. However, for a lookup table and uchar swap, I believe this would be a great optimization, and will probably produce the fastest routine, as the cost of the if() test would be amortized across 8 bits, and 8 consecutive zero bits is probably not that uncommon, whereas 32 consecutive zero bits might be rare. –  May 15 '13 at 09:51
3

Loop through the half of the array, convert and exchange bytes.

for( int i = 0; i < arraySize / 2; i++ ) {
    char inverted1 = invert( array[i] );
    char inverted2 = invert( array[arraySize - i - 1] );
    array[i] = inverted2;
    array[arraySize - i - 1] = inverted1;
}

For conversion use a precomputed table - an array of 2CHAR_BIT (CHAR_BIT will most likely be 8) elements where at position "I" the result of byte with value "I" inversion is stored. This will be very fast - one pass - and consume only 2CHAR_BIT for the table.

sharptooth
  • 167,383
  • 100
  • 513
  • 979
2

To reverse a single byte x you can handle the bits one at a time:

unsigned char a = 0;
for (i = 0; i < 8; ++i) {
   a += (unsigned char)(((x >> i) & 1) << (7 - i));
}

You can create a cache of these results in an array so that you can quickly reverse a byte just by making a single lookup instead of looping.

Then you just have to reverse the byte array, and when you write the data apply the above mapping. Reversing a byte array is a well documented problem, e.g. here.

Community
  • 1
  • 1
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
1

The data is pushed from system memory to the LCD driver over 8-bit IO port.

Since you'll be writing to the LCD one byte at a time, I think the best idea is to perform the bit reversal right when sending the data to the LCD driver rather than as a separate pre-pass. Something along those lines should be faster than any of the other answers:

void send_to_LCD(uint8_t* data, int len, bool rotate) {
  if (rotate)
    for (int i=len-1; i>=0; i--)
      write(reverse(data[i]));
  else
    for (int i=0; i<len; i++)
      write(data[i]);
}

Where write() is the function that sends a byte to the LCD driver and reverse() one of the single-byte bit reversal methods described in the other answers.

This approach avoids the need to store two copies of the video data in ram and also avoids the read-invert-write roundtrip. Also note that this is the simplest implementation: it could be trivially adapted to load, say, 4 bytes at a time from memory if this were to yield better performance. A smart vectorizing compiler may be even able to do it for you.

CAFxX
  • 28,060
  • 6
  • 41
  • 66
1

Single Core?

How much memory?

Is the display buffered in memory and pushed to the device, or is the only copy of the pixels in the screens memory?

Spence
  • 28,526
  • 15
  • 68
  • 103
  • Single core, ARM 9 series. 64MB, (to be scaled down to 32MB later), Linux. The memory is pushed from system memory (which I control) to the LCD driver memory (outside of my control). – SF. Jan 28 '10 at 10:34