4

Let's say that data is 1011 1001 and the mask is 0111 0110, then you have:

input data:       1011 1001
input mask:       0111 0110
apply mask:       0011 0000 (based on `input mask`)
bits selected:    -011 -00- (based on `input mask`)
right packed:     ---0 1100
expected result:  0000 1100 (set left `8 - popcount('input mask')` bits to zero) 

So the final output is 0000 1100 (note that the 3 positions on the left which are unspecified are zero-filled).

You can see that wherever the bits in input mask is 1 the corresponding value in input data is selected (in bits selected above) and then all selected bits are packed contiguously started in the least significant bits of the result (as shown in right packed above). Finally, any leftmost bits which are left over after the packing are set to 0 (there will be 8 - popcount(mask) such bits).

Obvious choice is rotate and select but that will consume 5 operations as mask has 5 bits. Can I do this in one step?

Note:

  1. The mask can be anything with arbitrary n bits ON (In above example n=5). All you know is the number of bits that are ON in the mask and the mask itself. Mask will keep on changing with n bits ON.

  2. In above example I have used data and mask of 8-bits but in real usage it can be 8, 16, 32, 64 and 128 bits.

ead
  • 32,758
  • 6
  • 90
  • 153
Praveen Kulkarni
  • 2,816
  • 1
  • 23
  • 39
  • What do you mean "moved to right hand side?" What is the output if the mask is 0111 0111? – janovak Jan 12 '17 at 15:49
  • So other examples would be `1101 0010` with mask `0001 1111` -> `0001 0010` and data `1101 0010` mask `1111 1000` -> `0001 1010`? And since the mask has 5 `1` bits, the high 3 bits of the result will always be `0`? – jaggedSpire Jan 12 '17 at 15:50
  • It would be easier to understand what you're trying to do if you'd include the existing (5-operation) code as an example in your question. – Jeremy Friesner Jan 12 '17 at 16:05
  • In fact, the question was a bit clearer when looking at the original (unformatted) text, but I think the author wasn't familiar enough with markdown to have that carry though to the formatted question. I added the correct formatting and further clarified the question (in fact, this is a reasonably well-known bit manipulation algorithm, and the inverse of [this one](http://stackoverflow.com/questions/16568134/expand-right-bitwise-algorithm)). – BeeOnRope Jan 13 '17 at 03:01
  • Possible duplicate of [Shift masked bits to the lsb](https://stackoverflow.com/questions/28282869/shift-masked-bits-to-the-lsb) – phuclv Jan 23 '18 at 05:42

1 Answers1

8

If you're targeting x86 most compilers will have an instrinsic for the pdep (parallel bit deposit) instruction which directly performs the operation you want, in hardware, at a rate of 1 per cycle (3 cycles latency)1, on Intel hardware that supports it. For example, gcc offers it as the _pdep_u32 and _pdep_u64 intrinsic functions.

Unfortunately, on AMD Ryzen (the only AMD hardware that supports BMI2) this operation is very slow: one per 18 cycles. You might want to have a separate code-path to support non-Intel platforms if they are important to you.

If you aren't on x86, you can find general purpose implementations of these options here - the specific operation you want is expand_right - and this other section will probably be of great interest in that it specifically covers the simple case where you are dealing with word-sized elements.

In practice, if you are really dealing with 8-bit data and mask values, you might just use a precomputed lookup table - either a big 8 bit x 8 bit = 65k one which covers all {data, mask} combinations and which gives you the answer directly, or a 256-entry one which covers all mask values and gives you some coefficients for a simple bit-shifting calculation or a multiplication-based code.

FWIW, I'm not sure how you can do it easily with 5 rotate instructions, because it seems that the naive solution needs 1 rotate instruction for each bit, whether set or not (so for a word size of 8 bits, 7 or 8 rotate2 instructions).


1 Of course, the performance in principle depends on the hardware, but on all the mainstream Intel CPUs that implement it, it's 1 cycle throughput, 3 cycles latency (not sure about AMD).

2 Only 7 rotates because the "rotate of 0" operation for the lowest order bit can evidently be omitted.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 5 rotations referred to the example mask with five up bits. (And the author indicated that this was entirely arbitrary.) – n.caillou Jan 16 '17 at 20:50
  • @n.caillou - understood, but I assume the mask is variable, both in the number of 1 bits and in their position, so it isn't obvious to me how you write code to uses N rotations for an N bit mask. Of course, it's is possible, e.g., with "count leading zeroes" or "find first 1" type instructions, but then you should use `pdep` or equivalents and I don't think that's what the author was talking about. – BeeOnRope Jan 16 '17 at 20:52
  • 1
    @n.caillou - BTW, I interpreted the author's final statement (_Note: the mask is arbitrary with 5 bits on._) not as 5 on bits being arbitrary, but rather as "The mask will always have exactly 5 set bits, but which bits are set is arbitrary" - but that doesn't seem like a particularly easier problem to solve than the one where the mask is totally arbitrary. – BeeOnRope Jan 16 '17 at 21:05
  • @BeeOnRope the number of bits can be anything. I will update the question – Praveen Kulkarni Jan 28 '17 at 13:45
  • @pra - OK, makes sense. My answer generally applies to any number of on bits. – BeeOnRope Jan 28 '17 at 15:00