1

I have an array of char (usually thousands of bytes long) read from a file, all composed of 0 and 1 (not '0' and '1', in which case I could use strtoul). I want to pack these into single bits, thus converting each 32 char into a single uint32_t. Should I write a bit shift operation with 32 parts, or is there a saner way?

out[i/32] = 
    data[i] << 31 |
    data[i+1] << 30 |
    data[i+2] << 29 |
    data[i+3] << 28 |
    data[i+4] << 27 |
    data[i+5] << 26 |
    data[i+6] << 25 |
    data[i+7] << 24 |
    data[i+8] << 23 |
    data[i+9] << 22 |
    data[i+10] << 21 |
    data[i+11] << 20 |
    data[i+12] << 19 |
    data[i+13] << 18 |
    data[i+14] << 17 |
    data[i+15] << 16 |
    data[i+16] << 15 |
    data[i+17] << 14 |
    data[i+18] << 13 |
    data[i+19] << 12 |
    data[i+20] << 11 |
    data[i+21] << 10 |
    data[i+22] << 9 |
    data[i+23] << 8 |
    data[i+24] << 7 |
    data[i+25] << 6 |
    data[i+26] << 5 |
    data[i+27] << 4 |
    data[i+28] << 3 |
    data[i+29] << 2 |
    data[i+30] << 1 |
    data[i+31];

If this monstrous bit shift is the fastest in run time, then I'll have to stick to it.

Rodrigo
  • 4,706
  • 6
  • 51
  • 94
  • 1
    Yep, bitwise `or` and shift as you iterate over each byte is about all you can do in this case I think. – Collin Dec 03 '18 at 02:44
  • The entire file is only 32 bytes long? Or you actually want to do this to a long stream of bytes? – BeeOnRope Dec 03 '18 at 03:08
  • @BeeOnRope It depends on the size of the raster. I'm testing it on a 40,000 char long (equivalent to a row in the raster). – Rodrigo Dec 03 '18 at 03:10
  • [How to create a byte out of 8 bool values (and vice versa)?](https://stackoverflow.com/q/8461126/995714), [What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?](https://stackoverflow.com/q/26200961/995714), [How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD](https://stackoverflow.com/q/52098873/995714) – phuclv Dec 03 '18 at 03:10
  • Possible duplicate of [What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?](https://stackoverflow.com/questions/26200961/whats-the-fastest-way-to-pack-32-0-1-values-into-the-bits-of-a-single-32-bit-va) – phuclv Dec 03 '18 at 03:10
  • @Rodrigo - then you should ask that question, because the best approach to pack 40,000 charts might be different than the best approach to pack 32, since many things can be hoisted out of the loop. – BeeOnRope Dec 03 '18 at 03:11
  • @BeeOnRope Thank you, I edited the first sentence to emphasize that. – Rodrigo Dec 03 '18 at 03:13
  • [C/C++ packing signed char into int](https://stackoverflow.com/q/2437283/995714) – phuclv Dec 03 '18 at 03:30

3 Answers3

2

Restricted to the x86 platform, you can use the PEXT instruction. It is part of the BMI2 instruction set extension on newer processors.

Use 32-bit instructions in a row and then merge the results in one value with shifts.

This is probably the optimal approach on Intel processors, but the disadvantage is that this instruction is slow on AMD Ryzen.

zx485
  • 28,498
  • 28
  • 50
  • 59
2

If you don't need the output bits to appear in exactly the same order as the input bytes, but if they can instead be "interleaved" in a specific way, then a fast and portable way to accomplish this is to take 8 blocks of 8 bytes (64 bytes total) and to combine all the LSBs together into a single 8 byte value.

Something like:

uint32_t extract_lsbs2(uint8_t (&input)[32]) {
  uint32_t t0, t1, t2, t3, t4, t5, t6, t7;
  memcpy(&t0, input + 0 * 4, 4);
  memcpy(&t1, input + 1 * 4, 4);
  memcpy(&t2, input + 2 * 4, 4);
  memcpy(&t3, input + 3 * 4, 4);
  memcpy(&t4, input + 4 * 4, 4);
  memcpy(&t5, input + 5 * 4, 4);
  memcpy(&t6, input + 6 * 4, 4);
  memcpy(&t7, input + 7 * 4, 4);

  return 
    (t0 << 0) |
    (t1 << 1) |
    (t2 << 2) |
    (t3 << 3) |
    (t4 << 4) |
    (t5 << 5) |
    (t6 << 6) |
    (t7 << 7);
}

This generates "not terrible, not great" code on most compilers.

If you use uint64_t instead of uint32_t it would generally be twice as fast (assuming you have more than 32 total bytes to transform) on a 64-bit platform.

With SIMD you could easy vectorize the entire operation in something like two instructions (for AVX2, but any x86 SIMD ISA will work): compare and pmovmskb.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • I can imagine a sequence of `pand`, `pmaddubsw`, `phaddw`, `phaddd` that would be compatible with more machines than AVX2 and likely to be quite fast... – Iwillnotexist Idonotexist Dec 03 '18 at 03:27
  • @IwillnotexistIdonotexist - I wasn't thinking of any special instructions in AVX2, just that it happens that 32 bytes the OP suggested needs only 1 AVX instruction. You could do it with SSE, just at half the speed. To be clear I am thinking of a `cmpeq; movmskb` sequence, with a scalar store. This should achieve close to 64 bits of output a cycle (combing two 32-bit regs into a 64-bit one to save store bandwidth), which means close to 64 bytes of input per cycle. It will hard to beat that I think with a fully-SIMD solution? – BeeOnRope Dec 03 '18 at 03:46
1

Bit shifting is the simplest way to go about this. Better to write code that reflects what you're actually doing rather than trying to micro-optimize.

So you want something like this:

char bits[32];
// populate bits
uint32_t value = 0;
for (int i=0; i<32; i++) {
    value |= (uint32_t)(bits[i] & 1) << i;
}
dbush
  • 205,898
  • 23
  • 218
  • 273
  • Thank you. Please take a look at my edit. Isn't a loop less efficient than a single operation? Or both are the same to the processor? – Rodrigo Dec 03 '18 at 03:03
  • 2
    @Rodrigo Many compilers will perform loop unrolling as an optimization. Don't worry about that level of manual optimization unless you've identified a specific bottleneck. – dbush Dec 03 '18 at 03:12