What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

Question

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).

What's the fastest way to do this in C?

This kind of loop is easily optimized by a decent compiler (possibly unrolled, possibly inlined etc...). What were your measures ? — quantdev, Oct 05 '14 at 08:50
Since you mention specific architectures, is it OK to use things such as SSE? — harold, Oct 05 '14 at 09:05
The fastest way is to skip storing them as `unsigned int` in the first place. — tenfour, Oct 05 '14 at 09:48
@tenfour: The fastest way is just knowing what the output of the entire program needs to be, and just write that out... — einpoklum, Oct 05 '14 at 10:23
@einpoklum: It's quite practical to abstract your `uint` array into something that packs into a bitfield right from the start. What you said is nonsense. — tenfour, Oct 05 '14 at 10:26
@tenfour: Are you asserting that bitfield access is as fast as scalar-word access? Because otherwise you've pushed the cost of accessing the booleans up in other parts of OP's code. — Ira Baxter, Oct 05 '14 at 10:42
@tenfour: I don't control the format in which I'm getting the array. — einpoklum, Oct 05 '14 at 10:50
THIS QUESTION IS NOT A DUPLICATE. It asks about the fastest possible way to pack a big set of booleans. The "duplicate" question asks about the most convenient way to pack a small set. You get completely different answers. — Ira Baxter, Oct 07 '14 at 11:09

doynax · Answer 1 · 2014-10-05T09:40:58.013

11

The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.

I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:

uint32_t bitpack(const bool array[32]) {
    __mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
    tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
    return _mm256_movemask_epi8(tmp);
}

Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.

edited Oct 05 '14 at 09:40

answered Oct 05 '14 at 09:32

doynax

4,285
3
23
19

@harold: Yeah, I just spotted and attempted to fix that issue. Sorry – doynax Oct 05 '14 at 09:41
Oh btw it seems that OP has wider input elements, could fix that with a couple of packs.. or leave as exercise for the reader of course :) – harold Oct 05 '14 at 09:51
Do you think you could add the appropriate `#IFDEF`s which ensure the relevant opcodes are available? – einpoklum Oct 05 '14 at 10:02
@harold: Ugh, you're right. Three PACKUSWB operations or the like ought to do the trick but let's hope the OP can switch types and has an AVX2 system, otherwise the resulting code gets a whole lot messier – doynax Oct 05 '14 at 10:03
@einpoklum: Don't quote me on this but I believe you can check `__AVX2__` in GCC (assuming -march=native/core-avx2). Hopefully you don't need to distribute a single binary which dynamically invokes code for the best available instruction set – doynax Oct 05 '14 at 10:15
Another point... what about making the writes non-temporal? I don't need those values cached. – einpoklum Oct 08 '14 at 19:26
@einpoklum: Are you referring to the writes producing the integer array? In that case I would look into any options for shrinking the data first. Playing with the cache in that way is likely to backfire unless you're very careful – doynax Oct 09 '14 at 10:58
@doynax: Yes, the writes producing the packed integer array. I'm not sure what you mean about shrinking the data - packing boolean values into single bits is the shrinking I'm doing, isn't it? – einpoklum Oct 09 '14 at 11:09
@einpoklum: If only a 0 or 1 is ever stored and have sufficient control over the writer to use non-temporal stores then I figured you might be able to emit bytes instead. Or emit small blocks and bit-pack more frequently while the cache is still hot. Taking manual control over the cache is basically an optimization of last resort – doynax Oct 09 '14 at 12:39
1

I think instead of comparing with 0, you can also shift the lsb left to move it to the most significant bit to use with `_mm256_movemask_epi8` – phuclv Mar 06 '18 at 17:28

score 7 · Answer 2 · edited Jun 20 '20 at 09:12

If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this

inline int pack8b(bool* a)
{
    uint64_t t = *((uint64_t*)a);
    return (0x8040201008040201*t >> 56) & 0xFF;
}

int pack32b(bool* a)
{
    return (pack8b(a +  0) << 24) | (pack8b(a +  8) << 16) |
           (pack8b(a + 16) <<  8) | (pack8b(a + 24) <<  0);
}

Explanation:

Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)

  |  a7  ||  a6  ||  a4  ||  a4  ||  a3  ||  a2  ||  a1  ||  a0  |
  .......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
  ────────────────────────────────────────────────────────────────
  ↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
  ↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
  ↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
  ↑..d.....↑.c......↑b.......a
  ↑.c......↑b.......a
  ↑b.......a
  a       
  ────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out

So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code

Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits

Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes

On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with

_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);

And to pack 2 uint32_t as the question requires use

_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);

Nobody said anything about `bool`s... but your suggestion can be adapted to unsigned ints, to some extent. — einpoklum, Oct 05 '14 at 10:51
It is fast, but I don't get the expected result... Can you try with the set of numbers in Galik's answer? I have been using that as a benchmark. (I already tried with `LL` at the end of the long constant.) ((Assuming a 1-byte source array.)) — Jongware, Oct 05 '14 at 10:53
This is a nice scheme if your processor does big multiplies fast. Your processor may not have that property. — Ira Baxter, Oct 05 '14 at 10:59
oops I misread the bool in the original title. It can be adapted to unsigned ints but then you can only compacts 2 bits in one step, and will be much slower than other solutions — phuclv, Oct 05 '14 at 11:54
@Jongware this is because the magic number above is for big endian. For little endian you must use `0x8040201008040201LL` — phuclv, Oct 05 '14 at 12:58

Ira Baxter · Answer 3 · 2014-10-05T10:54:01.100

Other answers contain an obvious loop implementation.

Here's a first variant:

unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
    result = (result<<1) + a[i];

On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See @Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).

Let us try something more radical.

Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.

Then the following code packs m*2 booleans into an int:

 (i1<<m+i2)

Using this we can pack 2^n bits as follows:

 unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer

 a2[0]=(a1[0]<<1)+a2[1];  // the original bits are a1[k]; can be scalar variables or ints
 a2[1]=(a1[2]<<1)+a1[3];  //  yes, you can use "|" instead of "+"
 ...
 a2[15]=(a1[30]<<1)+a1[31];

 a4[0]=(a2[0]<<2)+a2[1];
 a4[1]=(a2[2]<<2)+a2[3];
 ...
 a4[7]=(a2[14]<<2)+a2[15];

 a8[0]=(a4[0]<<4)+a4[1];
 a8[1]=(a4[2]<<4)+a4[3];
 a8[1]=(a4[4]<<4)+a4[5];
 a8[1]=(a4[6]<<4)+a4[7];

 a16[0]=(a8[0]<<8)+a8[1]);
 a16[1]=(a8[2]<<8)+a8[3]);

 a32[0]=(a16[0]<<16)+a16[1];

Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).

On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.

On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.

If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:

bool a1[32];

then we can abuse our knowledge of memory layout to fetch several at a time:

a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];

a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];

a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];

a32[0]=(a16[0]<<4)+a16[1];

Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).

To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).

As always, these solutions needed to performance tested.

I ran this using the same (rough!) timing as my own attempt, and lo and behold: it *is* faster, nothing less than 3 times! (3583, in the units in my answer). But it's still 31 shifts and adds, begging the question ... *why*? It's only 1 shift+add less than the naive solution. — Jongware, Oct 05 '14 at 10:16
Ira, this is a very nice answer, by in my question, `a` is an array of `unsigned int`s. Your solution is still valid but the text of the answer needs to be adapted accordingly, I'm afraid. — einpoklum, Oct 05 '14 at 10:21
.. to put the 'loop' thing in perspective, a loop-less implementation of the naive solution (`b |= (a[0]<<31); b |= (a[1]<<30); ..` scores 5560 in my timing. So this is *still* faster! — Jongware, Oct 05 '14 at 10:21
@einpoklum: Given your constraint of each bool being in an int by itself, you want my first variant. No edits to my answer needed that I see to address this. — Ira Baxter, Oct 05 '14 at 10:24
@Jongware: I suggested that long-distance shifts are unit costs on x86. If they are not, that would make a big difference. You can try my variant loop that only does unit shifts to see if that matters. — Ira Baxter, Oct 05 '14 at 10:26
Ira, I tested that as well -- I called this the "naive" solution, but "straightforward" may be a better name. It is the slowest of all suggestions. (Still only about 10% slower than my original answer.) — Jongware, Oct 05 '14 at 10:36
@Jongware: how can it be slower than Galik's answer, which does long distance shifts? If shift costs are unit, it should be same as his solution. — Ira Baxter, Oct 05 '14 at 10:38
In all modern x86 microarchs, shifts do have constant cost. However, there's often a difference between shifts by a constant and shifts by a variable (both have constant costs, but unequal to each other). — harold, Oct 05 '14 at 11:27
`for(unsigned i = 0; i < 32; ++i) result = (result<<1) + a[i];` produces the mirror image of the OPs goal. You should use `for(unsigned i = 32; i-- > 0;) result += result + a[i];` — chqrlie, Aug 06 '18 at 17:57

Galik · Answer 4 · 2018-08-06T18:14:22.240

5

I would probably go for this:

unsigned a[32] =
{
    1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
    , 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};

int main()
{
    unsigned b = 0;

    for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
        b |= a[i] << i;

    printf("b: %u\n", b);
}

Compiler optimization may well unroll that but just in case you can always try:

int main()
{
    unsigned b = 0;

    b |= a[0];
    b |= a[1] << 1;
    b |= a[2] << 2;
    b |= a[3] << 3;
    // ... etc
    b |= a[31] << 31;

    printf("b: %u\n", b);
}

edited Aug 06 '18 at 18:14

answered Oct 05 '14 at 08:51

Galik

47,303
4
80
117

2

Curious about the down vote (on both answers) is there a faster way? – Galik Oct 05 '14 at 09:13
1

I wasn't the downvote, but let's be serious now, OP already knows this. To avoid making it entirely trivial, at least break up the dependency chain into multiple parts or something like that. – harold Oct 05 '14 at 09:18
3

@harold I knew the OP probably knows this answer but I couldn't think of a faster way and the answers are important for future readers. Its not just the OP that wants to know the answer to these questions. – Galik Oct 05 '14 at 09:24

Jongware · Answer 5 · 2014-10-05T11:03:32.883

3

To determine what the fastest way is, time all of the various suggestions. Here is one ~~that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes)~~:

unsigned int bits[32][2] = {
    {0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
    {0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
    {0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
    {0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
    {0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
    {0,0x800},{0,0x400},{0,0x200},{0,0x100},
    {0,0x80},{0,0x40},{0,0x20},{0,0x10},
    {0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
     b |= bits[i][a[i]];

The first value in the array is to be the leftmost bit: the highest possible value.

Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):

Ira                   3618 ticks
naive, unrolled       5620 ticks
Ira, 1-shifted       10044 ticks
Galik                10265 ticks
Jongware, using adds 12536 ticks
Jongware             12682 ticks
naive                13373 ticks

(Relative timings, with the same compiler options.)

(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)

edited Oct 05 '14 at 11:03

answered Oct 05 '14 at 09:16

Jongware

22,200
8
54
100

Isn't access to array slower than bitwise operations? – GingerPlusPlus Oct 05 '14 at 09:20
Nice one, interesting :) – Galik Oct 05 '14 at 09:21
I would think this all fits easily in the cache so array access may be negligable. – Galik Oct 05 '14 at 09:22
@GingerPlusPlus: I guess that it depends on the caching behavior. Only real measurements can tell. – Siyuan Ren Oct 05 '14 at 09:22
@GingerPlusPlus: that is an excellent question. Note that this is one-way, iterative access; it can be trivially replaced by a single `add`. If the compiler doesn't do that as part of its optimizing, you can do that 'manually'. – Jongware Oct 05 '14 at 09:22
I understand that measuring would be needed, but I'd be very surprised if this ends up faster than the straight forward loop-and-or-in-a-bit method. in both cases I think the limiting factor would be iterating the `a[32]` array. – Michael Burr Oct 05 '14 at 09:27
2

must be the slowest one. replacing a fast math operation with a lookup sounds like a bad idea to me. – Karoly Horvath Oct 05 '14 at 09:40
This also has a high-memory touch cost, and lots of indexing arithmetic. – Ira Baxter Oct 05 '14 at 09:40
@Karoly: turns out you are correct! (*sob*) Replacing maths with a lookup *used* to be a killer trick. – Jongware Oct 05 '14 at 09:56
1

@Jongware: table lookup is still a win... but just for more complicated math than being done here. – Ira Baxter Oct 05 '14 at 10:50
@Jongware: Nice table of comparisons. Did you try my 1-bit shift unrolled? – Ira Baxter Oct 07 '14 at 11:15

score 2 · Answer 6 · answered Oct 05 '14 at 08:46

2

unsigned b=0;
for(int i=31; i>=0; --i){
    b<<=1;
    b|=a[i];
}

answered Oct 05 '14 at 08:46

GingerPlusPlus

5,336
1
29
52

Hmm, that's what I'm doing so far. I thought maybe there was some trick to speed this up. – einpoklum Oct 05 '14 at 08:55

chqrlie · Answer 7 · 2018-08-06T18:09:43.717

-1

Your problem is a good opportunity to use -->, also called the downto operator:

unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
    b += b + a[i];
}

The advantage of using --> is it works with both signed and unsigned loop index types.

This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

edited Aug 06 '18 at 18:09

answered Aug 06 '18 at 18:02

chqrlie

131,814
10
121
189

Does the anonymous downvoter lack some sense of humor? – chqrlie Aug 07 '18 at 15:31

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

7 Answers7

Explanation:

Linked

Related