Performing bit level permutations on a quadword

Question

I'm looking for the fastest possible way to permutate bits in a 64 bit integer.

Given a table called "array" corresponding to a permutations array, meaning it has a size of 64 and filled with unique numbers (i.e. no repetition) ranging from 0 to 63, corresponding to bit positions in a 64 bit integer, I can permutate bits this way

bit = GetBitAtPos(integer_, array[i]);
SetBitAtPos(integer_, array[i], GetBitAtPos(integer_, i));
SetBitAtPos(integer_, i, bit);

(by looping i from 0 to 63)

GetBitAtPos being
GetBitAtPos(integer_, pos) { return (integer >>) pos & 1 }

Setbitatpos is also founded on the same principle (i.e. using C operators), under the form SetBitAtPos(integer, position, bool_bit_value)

I was looking for a faster way, if possible, to perform this task. I'm open to any solution, including inline assembly if necessary. I have difficulty to figure a better way than this, so I thought I'd ask.

I'd like to perform such a task to hide data in a 64 bit generated integer (where the 4 first bit can reveal informations). It's a bit better than say a XOR mask imo (unless I miss something), mostly if someone tries to find a correlation. It also permits to do the inverse operation to not lose the precious bits...

However I find the operation to be a bit costly...

Thanks

Does it have to take a permutation table as the input, or could it, say, take steering bits for a benes network as input? (or better yet, can the permutation be constant?) — harold, Feb 20 '13 at 01:08
would loading/storing a precomputed table be faster than computing it? — גלעד ברקן, Feb 20 '13 at 01:37
Perhaps http://graphics.stanford.edu/~seander/bithacks.html is relevant here. — vonbrand, Feb 20 '13 at 03:54
Thanks for your comments. The permutation array is constant, yes. Say I have an array like this (for a 4 bit int), I'm looking for a way to do this : the permutation table would be { 3, 2, 0, 1 }, then I would switch the bit at pos 0 with the bit at pos 3, the bit at pos 1 with the one at pos 2, etc etc. And yes, the said permutation table would be kept constant (so I would know the for example 1st bit of the *initial* integer (before the permutations operation) would now be at 10th position etc etc). If you have better algorithms I'm also open. @MichaelDorgan : It would be for X86/X64 intel. — Yannick, Feb 20 '13 at 05:23
A better method of obfuscation would be to multiply with an odd constant, then multiply by its inverse modulo 2^64 to get back the original value: https://en.wikipedia.org/wiki/Modular_multiplicative_inverse — starblue, Feb 20 '13 at 07:37

harold · Answer 1 · 2016-02-03T08:04:26.820

Since the permutation is constant, you should be able to come up with a better way than moving the bits one by one (if you're OK with publishing your secret permutation, I can have a go at it). The simplest improvement is moving bits that have the same distance (that can be a modular distance because you can use rotates) between them in the input and output at the same time. This is a very good methods if there are few such groups.

If that didn't work out as well as you'd hoped, see if you can use bit_permute_steps to move all or most of the bits. See the rest of that site for more ideas.

If you can use PDEP and PEXT, you can move bits in groups where the distance between bits can arbitrarily change (but their order can not). It is, afaik, unknown how fast they will be though (and they're not available yet).

The best method is probably going to be a combination of these and other tricks mentioned in other answers.

There are too many possibilities to explore them all, really, so you're probably not going to find the best way to do the permutation, but using these ideas (and the others that were posted) you can doubtlessly find a better what than you're currently using.

PDEP and PEXT have been available for a while now so their performance is known, at 3 cycle latency and 1/cycle throughput they're faster than most other useful permutation primitives (except trivial ones).

score 0 · Answer 2 · edited May 23 '17 at 12:15

0

Split your bits into subsets where this method works:

Extracting bits with a single multiplication

Then combine the results using bitwise OR.

edited May 23 '17 at 12:15

Community

1
1

answered Feb 20 '13 at 06:39

starblue

55,348
14
97
151

Multiplication is only useful when one wants to copy a bit[group] to multiple locations. – Aki Suihkonen Feb 20 '13 at 10:46
@Aki Suihkonen If you do it carefully all but one of the multiple locations will be lost due to overflow (see the linked question). Or you could mask out unwanted bits. – starblue Feb 20 '13 at 14:02

Aki Suihkonen · Answer 3 · 2013-02-20T07:17:45.320

For 64-bit number I believe the problem (of finding best algorithm) may be unsolvable due to huge amount of possibilities. One of the most scalable and easiest to automatize would be look up table:

result = LUT0[ value & 0xff] +  
         LUT1[(value >> 8) & 0xff] +  
         LUT2[(value >> 16) & 0xff] + ...  
     +   LUT7[(value >> 56) & 0xff];

Each LUT entry must be 64-bit wide and it just spreads each 8 bits in a subgroup to the full range of 64 possible bins. This configuration uses 16k of memory.

The scalability comes from the fact that one can use any number of look up tables (practical range from 3 to 32?). This method is vulnerable to cache misses and it can't be parallelized (for large table sizes at least).

If there are certain symmetries, there are some clever trick available -- e.g. swapping two bits in Intel:

 test eax, (1<<BIT0 | 1<<BIT1)
 jpe skip:
 xor  eax, (1<<BIT0 | 1<<BIT1)
 skip:

This OTOH is highly vulnerable to branch mispredictions.

Performing bit level permutations on a quadword

3 Answers3