15

When you and some data with a mask you get some result which is of the same size as the data/mask. What I want to do, is to take the masked bits in the result (where there was 1 in the mask) and shift them to the right so they are next to each other and I can perform a CTZ (Count Trailing Zeroes) on them.

I didn't know how to name such a procedure so Google has failed me. The operation should preferably not be a loop solution, this has to be as fast operation as possible.

And here is an incredible image made in MS Paint. enter image description here

Tombart
  • 30,520
  • 16
  • 123
  • 136
cen
  • 2,873
  • 3
  • 31
  • 56

2 Answers2

17

This operation is known as compress right. It is implemented as part of BMI2 as the PEXT instruction, in Intel processors as of Haswell.

Unfortunately, without hardware support is it a quite annoying operation. Of course there is an obvious solution, just moving the bits one by one in a loop, here is the one given by Hackers Delight:

unsigned compress(unsigned x, unsigned m) {
   unsigned r, s, b;    // Result, shift, mask bit. 

   r = 0; 
   s = 0; 
   do {
      b = m & 1; 
      r = r | ((x & b) << s); 
      s = s + b; 
      x = x >> 1; 
      m = m >> 1; 
   } while (m != 0); 
   return r; 
} 

But there is an other way, also given by Hackers Delight, which does less looping (number of iteration logarithmic in the number of bits) but more per iteration:

unsigned compress(unsigned x, unsigned m) {
   unsigned mk, mp, mv, t; 
   int i; 

   x = x & m;           // Clear irrelevant bits. 
   mk = ~m << 1;        // We will count 0's to right. 

   for (i = 0; i < 5; i++) {
      mp = mk ^ (mk << 1);             // Parallel prefix. 
      mp = mp ^ (mp << 2); 
      mp = mp ^ (mp << 4); 
      mp = mp ^ (mp << 8); 
      mp = mp ^ (mp << 16); 
      mv = mp & m;                     // Bits to move. 
      m = m ^ mv | (mv >> (1 << i));   // Compress m. 
      t = x & mv; 
      x = x ^ t | (t >> (1 << i));     // Compress x. 
      mk = mk & ~mp; 
   } 
   return x; 
}

Notice that a lot of the values there depend only on m. Since you only have 512 different masks, you could precompute those and simplify the code to something like this (not tested)

unsigned compress(unsigned x, int maskindex) {
   unsigned t; 
   int i; 

   x = x & masks[maskindex][0];

   for (i = 0; i < 5; i++) {
      t = x & masks[maskindex][i + 1]; 
      x = x ^ t | (t >> (1 << i));
   } 
   return x; 
}

Of course all of these can be turned into "not a loop" by unrolling, the second and third ways are probably more suitable for that. That's a bit of cheat however.

harold
  • 61,398
  • 6
  • 86
  • 164
  • It turns out that my problem demands bits being compressed from higher to lower half the time. So, higher bit gets packet on lowest slot and down from there. Is there a version of this algorithm that can do that? Reversing the data and the mask would achieve that but I think that is too costly operation. – cen Feb 02 '15 at 20:02
  • @cen you can do that extract/flip with a butterfly network and a mask, explained somewhere else on that first link I gave in the answer. Perhaps it is easier to just reverse+extract though, reversing isn't too disastrous, for example see http://stackoverflow.com/a/9144870/555045 – harold Feb 02 '15 at 20:30
  • if the result is to be reversed then probably my proposed way is better without hardware support. [Bit twiddling hacks](http://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith64Bits) also use multiplication like this to reverse bits in a byte – phuclv Feb 03 '15 at 02:38
2

You can use the pack-by-multiplication technique similar to the one described here. This way you don't need any loop and can mix the bits in any order.

For example with the mask 0b10101001 == 0xA9 like above and 8-bit data abcdefgh (with a-h is the 8 bits) you can use the below expression to get 0000aceh

uint8_t compress_maskA9(uint8_t x)
{
    const uint8_t mask1 = 0xA9 & 0xF0;
    const uint8_t mask2 = 0xA9 & 0x0F;
    return (((x & mask1)*0x03000000 >> 28) & 0x0C) | ((x & mask2)*0x50000000 >> 30);
}

In this specific case there are some overlaps of the 4 bits while adding (which incur unexpected carry) during the multiplication step, so I've split them into 2 parts, the first one extracts bit a and c, then e and h will be extracted in the latter part. There are other ways to split the bits as well, like a & h then c & e. You can see the results compared to Harold's function live on ideone

An alternate way with only one multiplication

const uint32_t X = (x << 8) | x;
return (X & 0x8821)*0x12050000 >> 28;

I got this by duplicating the bits so that they're spaced out farther, leaving enough space to avoid the carry. This is often better than splitting into 2 multiplications


If you want the result's bits reversed (i.e. heca0000) you can easily change the magic numbers accordingly

// result: he00 | 00ca;
return (((x & 0x09)*0x88000000 >> 28) & 0x0C) | (((x & 0xA0)*0x04800000) >> 30);

or you can also extract the 3 bits e, c and a at the same time, leaving h separately (as I mentioned above, there are often multiple solutions) and you need only one multiplication

return ((x & 0xA8)*0x12400000 >> 29) | (x & 0x01) << 3; // result: 0eca | h000

But there might be a better alternative like the above second snippet

const uint32_t X = (x << 8) | x;
return (X & 0x2881)*0x80290000 >> 28

Correctness check: http://ideone.com/PYUkty

For a larger number of masks you can precompute the magic numbers correspond to those masks and store them in an array so that you can look them up immediately for use. I calculated those mask by hand but you can do that automatically


Explanation

We have abcdefgh & mask1 = a0c00000. Multiply it with magic1

    ........................a0c00000
 ×  00000011000000000000000000000000 (magic1 = 0x03000000)
    ────────────────────────────────
    a0c00000........................
 + a0c00000......................... (the leading "a" bit is outside int's range
    ────────────────────────────────  so it'll be truncated)
r1 = acc.............................

=> (r1 >> 28) & 0x0C = 0000ac00

Similarly we multiply abcdefgh & mask2 = 0000e00h with magic2

  ........................0000e00h
× 01010000000000000000000000000000 (magic2 = 0x50000000)
  ────────────────────────────────
  e00h............................
+ 0h..............................
  ────────────────────────────────
r2 = eh..............................

=> (r2 >> 30) = 000000eh

Combine them together we have the expected result

((r1 >> 28) & 0x0C) | (r2 >> 30) = 0000aceh

And here's the demo for the second snippet

                  abcdefghabcdefgh
&                 1000100000100001 (0x8821)
  ────────────────────────────────
                  a000e00000c0000h
× 00010010000001010000000000000000 (0x12050000)
  ────────────────────────────────
  000h
  00e00000c0000h
+ 0c0000h
  a000e00000c0000h
  ────────────────────────────────
= acehe0h0c0c00h0h
& 11110000000000000000000000000000
  ────────────────────────────────
= aceh

For the reversed order case:

                  abcdefghabcdefgh
&                 0010100010000001 (0x2881)
  ────────────────────────────────
                  00c0e000a000000h
x 10000000001010010000000000000000 (0x80290000)
  ────────────────────────────────
  000a000000h
  00c0e000a000000h
+ 0e000a000000h
  h
  ────────────────────────────────
  hecaea00a0h0h00h
& 11110000000000000000000000000000
  ────────────────────────────────
= heca

Related:

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • How do you calculate the multipliers? – harold Feb 02 '15 at 18:24
  • @harold I just write the binary multiplication out and turn on the bit in the magic number where I want to shift the multiplicand to. For example to shift the least significant bit to bit 28 I turn bit 28 to 1. Here is a correction check with your function http://ideone.com/bud3dI – phuclv Feb 02 '15 at 18:30
  • I explained this technique in some other questions (http://stackoverflow.com/a/23875535/995714 and http://stackoverflow.com/a/26201755/995714) which may be easier to understand – phuclv Feb 02 '15 at 18:36