2

Is there a general strategy to create an efficient bit permutation algorithm. The goal is to create a fast branch-less and if possible LUT-less algorithm. I'll give an example:

A 13 bit code is to be transformed into another 13 bit code according to the following rule table:

BIT INPUT (DEC) INPUT (BIN) OUTPUT (BIN) OUTPUT (DEC)
0 1 0000000000001 0000100000000 256
1 2 0000000000010 0010000000000 1024
2 4 0000000000100 0100000000000 2048
3 8 0000000001000 1000000000000 4096
4 16 0000000010000 0000001000000 64
5 32 0000000100000 0000000100000 32
6 64 0000001000000 0001000000000 512
7 128 0000010000000 0000000010000 16
8 256 0000100000000 0000000001000 8
9 512 0001000000000 0000000000010 2
10 1024 0010000000000 0000000000001 1
11 2048 0100000000000 0000000000100 4
12 4096 1000000000000 0000010000000 128

Example: If the input code is 1+2+4096=4099 the resulting output would be 256+1024+128=1408

A naive approach would be:

OUTPUT = ((INPUT AND 0000000000001) << 8) OR ((INPUT AND 0000000000010) << 9) OR ((INPUT AND 0000000000100) << 9) OR ((INPUT AND 0000000001000) << 9) OR ...

It means we have 3 instructions per bit (AND, SHIFT, OR) = 39-1 (last OR omitted) instructions for the above example. Instead we could also use a combination of left and right shifts to potentially reduce code size (depends on target platform), but this will not decrease the amount of instructions.

When inspecting the example table, you will of course notice a few obvious possibilities for optimization, for example in line 2/3/4 which can be combined as ((INPUT AND 0000000000111) << 9). But beside that it is becoming a difficult tedious task.

Are the general strategies? I think using Karnaugh-Veitch Map's to simplify the expression could be one approach? However it is pretty difficult for 13 input variables. Also the resulting expression would only be a combination of OR's and AND's.

bkausbk
  • 2,740
  • 1
  • 36
  • 52
  • It looks a bit like: [Optimize lookup tables to simple ALU](https://stackoverflow.com/questions/72293060/optimize-lookup-tables-to-simple-alu/72296524#72296524) – Jérôme Richard Jun 20 '22 at 10:48
  • Am I correct that the output is a fixed permutation of bits of the input? – Rafał Dowgird Jun 20 '22 at 12:08
  • @RafałDowgird Yes you are right, each bit in input has an fixed bit on output. May be a better title to ask would be for a bit permutation algorithm strategy? – bkausbk Jun 20 '22 at 12:12
  • You can balance both. First 10 bits from LUT (that is 4kB) and remaining 3 bits from computation (9 operations). 100%-cache-hit on L1 query and 9 bitwise operations should be doable in instruction-level-parallelism. – huseyin tugrul buyukisik Jun 20 '22 at 12:18
  • What exactly is the input/output here? Do I understand correctly the input is a translation table (like you have provided, but with redundancy removed) and the output should be an expression in C-style syntax? A string? A tree? – trincot Jun 20 '22 at 12:22
  • @trincot The full table would contain 2^13 entries, I only included the entries where excatly one of the input bits is set. The input can be any 13 bit number. – bkausbk Jun 20 '22 at 12:28
  • I understand that, but we would only need as input the logic that you have depicted in the table, right? I understand that in the final logic the input would be any 13 bit number, but it seems you are asking a *meta* question here: not about the function that will take a 13 bit number and will produce the output for it, but a function that will generate *that* function. So then the output you want in this question, is the function, based on the table you included. Is this correct? If so, can you specify the exact format of input/output? – trincot Jun 20 '22 at 12:32
  • @trincot Yes and no. Yes it could be seen as a meta question, but it is also specific to my example table given. More general question could be "to find a strategy for a bit permutation algorithm with n bits". – bkausbk Jun 20 '22 at 12:37
  • OK, but let's keep it to one question. Which of the two will it be? :) – trincot Jun 20 '22 at 12:38
  • @trincot If I say meta question it probably isn't the right place to ask here ;) So my question is to find a solution for my table given. – bkausbk Jun 20 '22 at 12:39
  • It surely would be the right place here, as the "meta" still is about programming. And in fact, it would be the more interesting question. But OK, for the given table, you would then be satisfied with an AND/OR expression that performs the translation? – trincot Jun 20 '22 at 12:41
  • @trincot If the equivalent AND/OR expression would be less instructions (faster) than a solution with branches then yes. – bkausbk Jun 20 '22 at 12:42
  • What do you have in mind with "branches"? – trincot Jun 20 '22 at 12:48
  • Is this it? `((input & 0b10000000000) >> 10) | ((input & 0b100000000000) >> 9) | ((input & 0b1000000000) >> 8) | ((input & 0b1000100000000) >> 5) | ((input & 0b10000000) >> 3) | (input & 0b100000) | ((input & 0b10000) << 2) | ((input & 0b1000000) << 3) | ((input & 0b1) << 8) | ((input & 0b1110) << 9)` – trincot Jun 20 '22 at 13:03
  • What I mean by branches is something like `if (input & 0b0000000000001) output |= 0b0000100000000;` and so on. (13 if statements). And the naive approach without branches is basically the 39-1 instruction count solution with the exception of bit 1,2,3 being combined in one expression. However what I want is further optimization of the entire expression ... if this is even possible. The 10 bit LUT as mentioned by @huseyintugrulbuyukisik wastes far too much of memory unfortunately. – bkausbk Jun 20 '22 at 13:54
  • There would still be 29 instructions/operations which most likely will result in larger execution time than just using branches. – bkausbk Jun 20 '22 at 14:00
  • @bkausbk You can do with a 7-bit LUT and a 6-bit LUT, each containing 16-bit values.That's 384 bytes total, two lookups and an OR. – Rafał Dowgird Jun 20 '22 at 20:07

3 Answers3

2

For bit permutations, several strategies are known that work in certain cases. There's a code generator at https://programming.sirrida.de/calcperm.php which implements most of them. However, in this case, it seems to find only basically the strategy you suggested, indicating that it seems hard to find any pattern to exploit in this permutation.

Falk Hüffner
  • 4,942
  • 19
  • 25
  • This was the information I was looking for. There is plenty of explanation about strategies for various concepts. – bkausbk Jun 21 '22 at 06:03
1

If one big lookup table is too much, you can try to use two smaller ones.

  1. Take 7 lower bits of the input, look up a 16-bit value in table A.

  2. Take 6 higher bits of the input, look up a 16-bit value in table B.

  3. or the values from 1. and 2. to produce the result.

Table A needs 128*2 bytes, table B needs 64*2 bytes, that's 384 bytes for the lookup tables.

Rafał Dowgird
  • 43,216
  • 11
  • 77
  • 90
1

This is a hand-optimised multiple LUT solution, which doesn't really prove anything other than that I had some time to burn.

Multiple small lookup tables can occasionally save time and/or space, but I don't know of a strategy to find the optimal combination. In this case, the best division seems to be three LUTs of three bits each (bits 4-6, 7-9 and 10-12), totalling 24 bytes (each table has 8 one-byte entries), plus a simple shift to cover bits through 3, and another simple shift for the remaining bit 0. Bit 5, which is untransformed, was also a tempting target but I don't see a good way to divide bit ranges around it.

The three look-up tables have single-byte entries because the range of the transformations for each range is just one byte. In fact, the transformations for two of the bit ranges fit entirely in the low-order byte, avoiding a shift.

Here's the code:

unsigned short permute_bits(unsigned short x) {
#define LUT3(BIT0, BIT1, BIT2)                                \
{ 0,      (BIT0),        (BIT1),        (BIT1)+(BIT0),        \
  (BIT2), (BIT2)+(BIT0), (BIT2)+(BIT1), (BIT2)+(BIT1)+(BIT0)}
    static const unsigned char t4[] =  LUT3(1<<(6-3), 1<<(5-3), 1<<(9-3));
    static const unsigned char t7[] =  LUT3(1<<4, 1<<3, 1<<1);
    static const unsigned char t10[] = LUT3(1<<0, 1<<2, 1<<7);
#undef LUT3
    return   (   (x&1)       << 8)    // Bit 0
           + (   (x&14)      << 9)    // Bits 1-3, simple shift
           + (t4[(x>>4)&7]   << 3)    // Bits 4-6, see below
           + (t7[(x>>7)&7]       )    // Bits 7-9, three-bit lookup for LOB
           + (t10[(x>>10)&7]     );   // Bits 10-12, ditto
}

Note on bits 4-6

Bit 6 is transformed to position 9, which is outside of the low-order byte. However, bits 4 and 5 are moved to positions 6 and 5, respectively, and the total range of the transformed bits is only 5 bit positions. Several different final shifts are possible, but keeping the shift relatively small provides a tiny improvement on x86 architecture, because it allows the use of LEA to do a simultaneous shift and add. (See the second last instruction in the assembly below.)

The intermediate results are added instead of using boolean OR for the same reason. Since the sets of bits in each intermediate result are disjoint, ADD and OR have the same result; using add can take advantage of chip features like LEA.

Here's the compilation of that function, taken from http://gcc.godbolt.org using gcc 12.1 with -O3:

permute_bits(unsigned short):
        mov     edx, edi
        mov     ecx, edi
        movzx   eax, di
        shr     di, 4
        shr     dx, 7
        shr     cx, 10
        and     edi, 7
        and     edx, 7
        and     ecx, 7
        movzx   ecx, BYTE PTR permute_bits(unsigned short)::t10[rcx]
        movzx   edx, BYTE PTR permute_bits(unsigned short)::t7[rdx]
        add     edx, ecx
        mov     ecx, eax
        sal     eax, 9
        sal     ecx, 8
        and     ax, 7168
        and     cx, 256
        or      eax, ecx
        add     edx, eax
        movzx   eax, BYTE PTR permute_bits(unsigned short)::t4[rdi]
        lea     eax, [rdx+rax*8]
        ret

I left out the lookup tables themselves because the assembly produced by GCC isn't very helpful.

I don't know if this is any quicker than @trincot's solution (in a comment); a quick benchmark was inconclusive, but it looked to be a few percent quicker. But it's quite a bit shorter, possibly enough to compensate for the 24 bytes of lookup data.

rici
  • 234,347
  • 28
  • 237
  • 341
  • This seems to be a pretty good implementation even on my PIC32 MCU. If I counted right there were only 19 instructions. – bkausbk Jun 21 '22 at 05:57