10

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).

What's the fastest way to do this in C?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • This kind of loop is easily optimized by a decent compiler (possibly unrolled, possibly inlined etc...). What were your measures ? – quantdev Oct 05 '14 at 08:50
  • 3
    Since you mention specific architectures, is it OK to use things such as SSE? – harold Oct 05 '14 at 09:05
  • The fastest way is to skip storing them as `unsigned int` in the first place. – tenfour Oct 05 '14 at 09:48
  • @tenfour: The fastest way is just knowing what the output of the entire program needs to be, and just write that out... – einpoklum Oct 05 '14 at 10:23
  • @einpoklum: It's quite practical to abstract your `uint` array into something that packs into a bitfield right from the start. What you said is nonsense. – tenfour Oct 05 '14 at 10:26
  • @tenfour: Are you asserting that bitfield access is as fast as scalar-word access? Because otherwise you've pushed the cost of accessing the booleans up in other parts of OP's code. – Ira Baxter Oct 05 '14 at 10:42
  • 2
    @tenfour: I don't control the format in which I'm getting the array. – einpoklum Oct 05 '14 at 10:50
  • THIS QUESTION IS NOT A DUPLICATE. It asks about the fastest possible way to pack a big set of booleans. The "duplicate" question asks about the most convenient way to pack a small set. You get completely different answers. – Ira Baxter Oct 07 '14 at 11:09
  • 1
    @ShafikYaghmour: it's not a dup. what are you doing?? – Karoly Horvath Oct 07 '14 at 11:28

7 Answers7

11

The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.

I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:

uint32_t bitpack(const bool array[32]) {
    __mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
    tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
    return _mm256_movemask_epi8(tmp);
}

Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.

doynax
  • 4,285
  • 3
  • 23
  • 19
  • @harold: Yeah, I just spotted and attempted to fix that issue. Sorry – doynax Oct 05 '14 at 09:41
  • Oh btw it seems that OP has wider input elements, could fix that with a couple of packs.. or leave as exercise for the reader of course :) – harold Oct 05 '14 at 09:51
  • Do you think you could add the appropriate `#IFDEF`s which ensure the relevant opcodes are available? – einpoklum Oct 05 '14 at 10:02
  • @harold: Ugh, you're right. Three PACKUSWB operations or the like ought to do the trick but let's hope the OP can switch types and has an AVX2 system, otherwise the resulting code gets a whole lot messier – doynax Oct 05 '14 at 10:03
  • @einpoklum: Don't quote me on this but I believe you can check `__AVX2__` in GCC (assuming -march=native/core-avx2). Hopefully you don't need to distribute a single binary which dynamically invokes code for the best available instruction set – doynax Oct 05 '14 at 10:15
  • Another point... what about making the writes non-temporal? I don't need those values cached. – einpoklum Oct 08 '14 at 19:26
  • @einpoklum: Are you referring to the writes producing the integer array? In that case I would look into any options for shrinking the data first. Playing with the cache in that way is likely to backfire unless you're very careful – doynax Oct 09 '14 at 10:58
  • @doynax: Yes, the writes producing the packed integer array. I'm not sure what you mean about shrinking the data - packing boolean values into single bits is the shrinking I'm doing, isn't it? – einpoklum Oct 09 '14 at 11:09
  • @einpoklum: If only a 0 or 1 is ever stored and have sufficient control over the writer to use non-temporal stores then I figured you might be able to emit bytes instead. Or emit small blocks and bit-pack more frequently while the cache is still hot. Taking manual control over the cache is basically an optimization of last resort – doynax Oct 09 '14 at 12:39
  • 1
    I think instead of comparing with 0, you can also shift the lsb left to move it to the most significant bit to use with `_mm256_movemask_epi8` – phuclv Mar 06 '18 at 17:28
7

If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this

inline int pack8b(bool* a)
{
    uint64_t t = *((uint64_t*)a);
    return (0x8040201008040201*t >> 56) & 0xFF;
}

int pack32b(bool* a)
{
    return (pack8b(a +  0) << 24) | (pack8b(a +  8) << 16) |
           (pack8b(a + 16) <<  8) | (pack8b(a + 24) <<  0);
}

Explanation:

Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)

  |  a7  ||  a6  ||  a4  ||  a4  ||  a3  ||  a2  ||  a1  ||  a0  |
  .......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
  ────────────────────────────────────────────────────────────────
  ↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
  ↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
  ↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
  ↑..d.....↑.c......↑b.......a
  ↑.c......↑b.......a
  ↑b.......a
  a       
  ────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out

So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code

Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits


Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes

On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with

_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);

And to pack 2 uint32_t as the question requires use

_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);
Community
  • 1
  • 1
phuclv
  • 37,963
  • 15
  • 156
  • 475
  • Nobody said anything about `bool`s... but your suggestion can be adapted to unsigned ints, to some extent. – einpoklum Oct 05 '14 at 10:51
  • It is fast, but I don't get the expected result... Can you try with the set of numbers in Galik's answer? I have been using that as a benchmark. (I already tried with `LL` at the end of the long constant.) ((Assuming a 1-byte source array.)) – Jongware Oct 05 '14 at 10:53
  • 2
    This is a nice scheme if your processor does big multiplies fast. Your processor may not have that property. – Ira Baxter Oct 05 '14 at 10:59
  • oops I misread the bool in the original title. It can be adapted to unsigned ints but then you can only compacts 2 bits in one step, and will be much slower than other solutions – phuclv Oct 05 '14 at 11:54
  • 2
    @Jongware this is because the magic number above is for big endian. For little endian you must use `0x8040201008040201LL` – phuclv Oct 05 '14 at 12:58
6

Other answers contain an obvious loop implementation.

Here's a first variant:

unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
    result = (result<<1) + a[i];

On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See @Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).

Let us try something more radical.

Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.

Then the following code packs m*2 booleans into an int:

 (i1<<m+i2)

Using this we can pack 2^n bits as follows:

 unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer

 a2[0]=(a1[0]<<1)+a2[1];  // the original bits are a1[k]; can be scalar variables or ints
 a2[1]=(a1[2]<<1)+a1[3];  //  yes, you can use "|" instead of "+"
 ...
 a2[15]=(a1[30]<<1)+a1[31];

 a4[0]=(a2[0]<<2)+a2[1];
 a4[1]=(a2[2]<<2)+a2[3];
 ...
 a4[7]=(a2[14]<<2)+a2[15];

 a8[0]=(a4[0]<<4)+a4[1];
 a8[1]=(a4[2]<<4)+a4[3];
 a8[1]=(a4[4]<<4)+a4[5];
 a8[1]=(a4[6]<<4)+a4[7];

 a16[0]=(a8[0]<<8)+a8[1]);
 a16[1]=(a8[2]<<8)+a8[3]);

 a32[0]=(a16[0]<<16)+a16[1];

Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).

On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.

On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.

If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:

bool a1[32];

then we can abuse our knowledge of memory layout to fetch several at a time:

a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];

a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];

a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];

a32[0]=(a16[0]<<4)+a16[1];

Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).

To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).

As always, these solutions needed to performance tested.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • I ran this using the same (rough!) timing as my own attempt, and lo and behold: it *is* faster, nothing less than 3 times! (3583, in the units in my answer). But it's still 31 shifts and adds, begging the question ... *why*? It's only 1 shift+add less than the naive solution. – Jongware Oct 05 '14 at 10:16
  • Ira, this is a very nice answer, by in my question, `a` is an array of `unsigned int`s. Your solution is still valid but the text of the answer needs to be adapted accordingly, I'm afraid. – einpoklum Oct 05 '14 at 10:21
  • .. to put the 'loop' thing in perspective, a loop-less implementation of the naive solution (`b |= (a[0]<<31); b |= (a[1]<<30); ..` scores 5560 in my timing. So this is *still* faster! – Jongware Oct 05 '14 at 10:21
  • @einpoklum: Given your constraint of each bool being in an int by itself, you want my first variant. No edits to my answer needed that I see to address this. – Ira Baxter Oct 05 '14 at 10:24
  • @Jongware: I suggested that long-distance shifts are unit costs on x86. If they are not, that would make a big difference. You can try my variant loop that only does unit shifts to see if that matters. – Ira Baxter Oct 05 '14 at 10:26
  • Ira, I tested that as well -- I called this the "naive" solution, but "straightforward" may be a better name. It is the slowest of all suggestions. (Still only about 10% slower than my original answer.) – Jongware Oct 05 '14 at 10:36
  • @Jongware: how can it be slower than Galik's answer, which does long distance shifts? If shift costs are unit, it should be same as his solution. – Ira Baxter Oct 05 '14 at 10:38
  • Ira, apologies, I mixed it up with `result |= a[i]< – Jongware Oct 05 '14 at 10:45
  • In all modern x86 microarchs, shifts do have constant cost. However, there's often a difference between shifts by a constant and shifts by a variable (both have constant costs, but unequal to each other). – harold Oct 05 '14 at 11:27
  • `for(unsigned i = 0; i < 32; ++i) result = (result<<1) + a[i];` produces the mirror image of the OPs goal. You should use `for(unsigned i = 32; i-- > 0;) result += result + a[i];` – chqrlie Aug 06 '18 at 17:57
5

I would probably go for this:

unsigned a[32] =
{
    1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
    , 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};

int main()
{
    unsigned b = 0;

    for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
        b |= a[i] << i;

    printf("b: %u\n", b);
}

Compiler optimization may well unroll that but just in case you can always try:

int main()
{
    unsigned b = 0;

    b |= a[0];
    b |= a[1] << 1;
    b |= a[2] << 2;
    b |= a[3] << 3;
    // ... etc
    b |= a[31] << 31;

    printf("b: %u\n", b);
}
Galik
  • 47,303
  • 4
  • 80
  • 117
  • 2
    Curious about the down vote (on both answers) is there a faster way? – Galik Oct 05 '14 at 09:13
  • 1
    I wasn't the downvote, but let's be serious now, OP already knows this. To avoid making it entirely trivial, at least break up the dependency chain into multiple parts or something like that. – harold Oct 05 '14 at 09:18
  • 3
    @harold I knew the OP probably knows this answer but I couldn't think of a faster way and the answers are important for future readers. Its not just the OP that wants to know the answer to these questions. – Galik Oct 05 '14 at 09:24
3

To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):

unsigned int bits[32][2] = {
    {0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
    {0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
    {0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
    {0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
    {0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
    {0,0x800},{0,0x400},{0,0x200},{0,0x100},
    {0,0x80},{0,0x40},{0,0x20},{0,0x10},
    {0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
     b |= bits[i][a[i]];

The first value in the array is to be the leftmost bit: the highest possible value.

Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):

Ira                   3618 ticks
naive, unrolled       5620 ticks
Ira, 1-shifted       10044 ticks
Galik                10265 ticks
Jongware, using adds 12536 ticks
Jongware             12682 ticks
naive                13373 ticks

(Relative timings, with the same compiler options.)

(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)

Jongware
  • 22,200
  • 8
  • 54
  • 100
2
unsigned b=0;
for(int i=31; i>=0; --i){
    b<<=1;
    b|=a[i];
}
GingerPlusPlus
  • 5,336
  • 1
  • 29
  • 52
-1

Your problem is a good opportunity to use -->, also called the downto operator:

unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
    b += b + a[i];
}

The advantage of using --> is it works with both signed and unsigned loop index types.

This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

chqrlie
  • 131,814
  • 10
  • 121
  • 189