44

I need a fast way to get the position of all one bits in a 64-bit integer. For example, given x = 123703, I'd like to fill an array idx[] = {0, 1, 2, 4, 5, 8, 9, 13, 14, 15, 16}. We can assume we know the number of bits a priori. This will be called 1012 - 1015 times, so speed is of the essence. The fastest answer I've come up with so far is the following monstrosity, which uses each byte of the 64-bit integer as an index into tables that give the number of bits set in that byte and the positions of the ones:

int64_t x;            // this is the input
unsigned char idx[K]; // this is the array of K bits that are set
unsigned char *dst=idx, *src;
unsigned char zero, one, two, three, four, five;  // these hold the 0th-5th bytes
zero  =  x & 0x0000000000FFUL;
one   = (x & 0x00000000FF00UL) >> 8;
two   = (x & 0x000000FF0000UL) >> 16;
three = (x & 0x0000FF000000UL) >> 24;
four  = (x & 0x00FF00000000UL) >> 32;
five  = (x & 0xFF0000000000UL) >> 40;
src=tab0+tabofs[zero ]; COPY(dst, src, n[zero ]);
src=tab1+tabofs[one  ]; COPY(dst, src, n[one  ]);
src=tab2+tabofs[two  ]; COPY(dst, src, n[two  ]);
src=tab3+tabofs[three]; COPY(dst, src, n[three]);
src=tab4+tabofs[four ]; COPY(dst, src, n[four ]);
src=tab5+tabofs[five ]; COPY(dst, src, n[five ]);

where COPY is a switch statement to copy up to 8 bytes, n is array of the number of bits set in a byte and tabofs gives the offset into tabX, which holds the positions of the set bits in the X-th byte. This is about 3x faster than unrolled loop-based methods with __builtin_ctz() on my Xeon E5-2609. (See below.) I am currently iterating x in lexicographical order for a given number of bits set.

Is there a better way?

EDIT: Added an example (that I have subsequently fixed). Full code is available here: http://pastebin.com/79X8XL2P . Note: GCC with -O2 seems to optimize it away, but Intel's compiler (which I used to compose it) doesn't...

Also, let me give some additional background to address some of the comments below. The goal is to perform a statistical test on every possible subset of K variables out of a universe of N possible explanatory variables; the specific target right now is N=41, but I can see some projects needing N up to 45-50. The test basically involves factorizing the corresponding data submatrix. In pseudocode, something like this:

double doTest(double *data, int64_t model) {
  int nidx, idx[];
  double submatrix[][];
  nidx = getIndices(model, idx);  // get the locations of ones in model
  // copy data into submatrix
  for(int i=0; i<nidx; i++) {
    for(int j=0; j<nidx; j++) {
      submatrix[i][j] = data[idx[i]][idx[j]];
    }
  }
  factorize(submatrix, nidx);
  return the_answer;
}

I coded up a version of this for an Intel Phi board that should complete the N=41 case in about 15 days, of which ~5-10% of the time is spent in a naive getIndices() so right off the bat a faster version could save a day or more. I'm working on an implementation for NVidia Kepler too, but unfortunately the problem I have (ludicrous numbers of small matrix operations) is not ideally suited to the hardware (ludicrously large matrix operations). That said, this paper presents a solution that seems to achieve hundreds of GFLOPS/s on matrices of my size by aggressively unrolling loops and performing the entire factorization in registers, with the caveat that the dimensions of the matrix be defined at compile-time. (This loop unrolling should help reduce overhead and improve vectorization in the Phi version too, so getIndices() will become more important!) So now I'm thinking my kernel should look more like:

double *data;  // move data to GPU/Phi once into shared memory
template<unsigned int K> double doTestUnrolled(int *idx) {
  double submatrix[K][K];
  // copy data into submatrix
  #pragma unroll
  for(int i=0; i<K; i++) {
    #pragma unroll
    for(int j=0; j<K; j++) {
      submatrix[i][j] = data[idx[i]][idx[j]];
    }
  }
  factorizeUnrolled<K>(submatrix);
  return the_answer;
}

The Phi version solves each model in a `cilk_for' loop from model=0 to 2N (or, rather, a subset for testing), but now in order to batch work for the GPU and amortize the kernel launch overhead I have to iterate model numbers in lexicographical order for each of K=1 to 41 bits set (as doynax noted).

EDIT 2: Now that vacation is over, here are some results on my Xeon E5-2602 using icc version 15. The code that I used to benchmark is here: http://pastebin.com/XvrGQUat. I perform the bit extraction on integers that have exactly K bits set, so there is some overhead for the lexicographic iteration measured in the "Base" column in the table below. These are performed 230 times with N=48 (repeating as necessary).

"CTZ" is a loop that uses the the gcc intrinsic __builtin_ctzll to get the lowest order bit set:

for(int i=0; i<K; i++) {
    idx[i] = __builtin_ctzll(tmp);
    lb = tmp & -tmp;    // get lowest bit
    tmp ^= lb;      // remove lowest bit from tmp
} 

Mark is Mark's branchless for loop:

for(int i=0; i<K; i++) {
    *dst = i;
    dst += x & 1;
    x >>= 1;
} 

Tab1 is my original table-based code with the following copy macro:

#define COPY(d, s, n) \
switch(n) { \
case 8: *(d++) = *(s++); \
case 7: *(d++) = *(s++); \
case 6: *(d++) = *(s++); \
case 5: *(d++) = *(s++); \
case 4: *(d++) = *(s++); \
case 3: *(d++) = *(s++); \
case 2: *(d++) = *(s++); \
case 1: *(d++) = *(s++); \
case 0: break;        \
}

Tab2 is the same code as Tab1, but the copy macro just moves 8 bytes as a single copy (taking ideas from doynax and Lưu Vĩnh Phúc... but note this does not ensure alignment):

#define COPY2(d, s, n) { *((uint64_t *)d) = *((uint64_t *)s); d+=n; }

Here are the results. I guess my initial claim that Tab1 is 3x faster than CTZ only holds for large K (where I was testing). Mark's loop is faster than my original code, but getting rid of the branch in the COPY2 macro takes the cake for K > 8.

 K    Base    CTZ   Mark   Tab1   Tab2
001  4.97s  6.42s  6.66s 18.23s 12.77s
002  4.95s  8.49s  7.28s 19.50s 12.33s
004  4.95s  9.83s  8.68s 19.74s 11.92s
006  4.95s 16.86s  9.53s 20.48s 11.66s
008  4.95s 19.21s 13.87s 20.77s 11.92s
010  4.95s 21.53s 13.09s 21.02s 11.28s
015  4.95s 32.64s 17.75s 23.30s 10.98s
020  4.99s 42.00s 21.75s 27.15s 10.96s
030  5.00s 100.64s 35.48s 35.84s 11.07s
040  5.01s 131.96s 44.55s 44.51s 11.58s
phuclv
  • 37,963
  • 15
  • 156
  • 475
Andrew
  • 867
  • 7
  • 20
  • 5
    Out of curiosity, what sort of applications does this have? – Cameron Dec 20 '13 at 22:39
  • "We can assume we know the number of bits a priori." Is there a (small) upper bound, or can it happen that (nearly) all bits are set? – Thomas Dec 20 '13 at 22:42
  • Instead of having a lookup table on bytes, you could try having one on 16-bit shorts instead. – Thomas Dec 20 '13 at 22:43
  • @cameron: Finding the best statistical model of a dependent variable given N possible independent variables so we need to try all 2^N models (N=40-50). This code is just enumerating the models to send to GPUs to solve. – Andrew Dec 20 '13 at 22:44
  • @Thomas: Upper bound is 50, realistically. – Andrew Dec 20 '13 at 22:45
  • 3
    Isn't trying a single model far more expensive than computing this bit thingy? Or do you have some impressive computing farm at your disposal? I mean, merely counting to 10^12 takes the average computer quite a while, let alone 10^15... – Thomas Dec 20 '13 at 22:47
  • 9
    I don't like much like those (presumably unpredictable) branches for entering the unrolled copying loops. Have you tried just blasting off a full eight bytes and advancing dst by the count from the table? Also perhaps you could post the generated assembly for us to peruse as well? We're well into micro-optimization territory and it's easy to make unwarranted assumptions about code generation – doynax Dec 20 '13 at 22:48
  • What does your output look like? I am having trouble seeing how COPY is selecting actual bits, when n[] just returns the number of bits set in its index. Havent you lost to info on which bits exist after doing that? – RichardPlunkett Dec 20 '13 at 23:09
  • 2
    You know the code you posted only works on 48 of the 64 bits, right? If your limit is 50 bits then you're 2 short. – Mark Ransom Dec 21 '13 at 00:01
  • 1
    Do you have all 10^15 qwords to start with? What format do you want the output in? Graphics card? Number of processors? Odds are this is going to be rather memory io bound. – Yakk - Adam Nevraumont Dec 21 '13 at 00:26
  • I see the comment that answers some of the above: so you really want the output, not the conversion? – Yakk - Adam Nevraumont Dec 21 '13 at 00:28
  • can you post entire code online, i cant figure out what you want or how you are calculating it atm – NoSenseEtAl Dec 21 '13 at 00:54
  • An example or 2 would help. – chux - Reinstate Monica Dec 21 '13 at 01:33
  • @Thomas: Well, right now it's not a big server farm but depending on budgets I could have several more GPUs at my disposal next year for bigger problems such that the CPU will have trouble keeping up. But honestly, after the first couple optimizations that took the N=40 case down from a couple days to a couple hours, I'm hooked and want to see how fast it can go ...:) – Andrew Dec 21 '13 at 01:37
  • @MarkRansom: Yes, the initial problem has 41 variables and I got tired of cut-and-pasting... when someone actually has a 50-variable problem I'll add the last line. – Andrew Dec 21 '13 at 02:13
  • Well, the fastest way is to have a 184 quadrillion entry table. Failing that you can break it into smaller table several ways, depending on your storage "budget", and how much time you're willing to devote to initializing the tables. – Hot Licks Dec 21 '13 at 02:21
  • 4
    I'm wondering how having a list of bits set is more useful than having the "int64_t x" with those bits set? Also, knowing your dataset could help a lot for optimizations (for example: are most inputs 0?)... and with 10^12 calls optimization maybe valuable. Also, have you you looked at (man -k bit) ffs(3), ffs(3p), ffsl(3), ffsll(3), they may be using the CPU instructions designed for this... and C won't be able to match a hand coded assembly inline function or macro. – 9mjb Dec 21 '13 at 13:40
  • 2
    Have you tried sending the masks to the GPU and have the GPU do the unpacking? Whatever you're doing with those indexes, you probably don't *really* need them as a list.. – harold Dec 21 '13 at 14:08
  • Yeah, the mask is a much more compact form than a list and easier to pass around/manage. The time required to shift a mask and test a bit is trivial compared to the cost of passing around a large, variable-length list. – Hot Licks Dec 21 '13 at 18:57
  • I did not test this... but @Dr.Kameleon asks "Get an array of the bit positions within a 64-bit integer" and a more efficient answer in http://stackoverflow.com/questions/14086854/get-an-array-of-the-bit-positions-within-a-64-bit-integer – 9mjb Dec 24 '13 at 17:09
  • Depending on how desperate you are, consider using an FPGA. Bit fiddling costs you nothing there. – maxy Dec 25 '13 at 08:59
  • @Andrew a clarification required. suppose i run you code to obtain the indexes in a loop for 2^32 times how long would it take in your implementation? – Koushik Shetty Dec 29 '13 at 07:48
  • @koushik Now that I'm back from vacation I will post timing results over the weekend – Andrew Jan 02 '14 at 22:12
  • @Andrew yeah that will help in comparing my results also. then i can post based on that. – Koushik Shetty Jan 03 '14 at 05:10
  • @koushik I've updates with results of a (limited) bake-off – Andrew Jan 07 '14 at 00:50
  • If you use the lowest bits first you can use v&=v-1 to clear them. Something like: for (; v; v &= v-1) { nextbit = __builtin_ctz(v); } – Marcelo Pacheco Oct 26 '20 at 14:53

11 Answers11

7

I believe the key to performance here is to focus on the larger problem rather than on micro-optimizing the extraction of bit positions out of a random integer.

Judging by your sample code and previous SO question you are enumerating all words with K bits set in order, and extracting the bit indices out of these. This greatly simplifies matters.

If so then instead of rebuilding the bit position each iteration try directly incrementing the positions in the bit array. Half of the time this will involve a single loop iteration and increment.

Something along these lines:

// Walk through all len-bit words with num-bits set in order
void enumerate(size_t num, size_t len) {
    size_t i;
    unsigned int bitpos[64 + 1];

    // Seed with the lowest word plus a sentinel
    for(i = 0; i < num; ++i)
        bitpos[i] = i;
    bitpos[i] = 0;

    // Here goes the main loop
    do {
        // Do something with the resulting data
        process(bitpos, num);

        // Increment the least-significant series of consecutive bits
        for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i)
            bitpos[i] = i;
    // Stop on reaching the top
    } while(++bitpos[i] != len);
}

// Test function
void process(const unsigned int *bits, size_t num) {
    do
        printf("%d ", bits[--num]);
    while(num);
    putchar('\n');
}

Not particularly optimized but you get the general idea.

doynax
  • 4,285
  • 3
  • 23
  • 19
  • Your point is well-taken (even though I didn't put it in the bake-off update...) – Andrew Jan 06 '14 at 23:23
  • Aww.. You've got me on tenterhooks. It may be cheating but at least post the numbers separately. Anyway, as you may have guessed by now the _real_ key is to take this one step further and start looking at better (search?) algorithms to avoid needing to brute-force evaluate all permutations in the first place – doynax Jan 06 '14 at 23:32
  • Sure I'll post numbers tomorrow when I'm back at work. Unfortunately the theory dictates that we do the sum over the entire power set (anything less is an approximation) and the whole purpose of this exercise is to quantify the impact of "better" algorithms that people have come up with... – Andrew Jan 07 '14 at 00:45
6

Here's something very simple which might be faster - no way to know without testing. Much will depend on the number of bits set vs. the number unset. You could unroll this to remove branching altogether but with today's processors I don't know if it would speed up at all.

unsigned char idx[K+1]; // need one extra for overwrite protection
unsigned char *dst=idx;
for (unsigned char i = 0; i < 50; i++)
{
    *dst = i;
    dst += x & 1;
    x >>= 1;
}

P.S. your sample output in the question is wrong, see http://ideone.com/2o032E

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • Thanks for catching the example error, I have no idea what I was thinking (or not, rather). I will compare on Monday with Intel compiler. – Andrew Dec 21 '13 at 03:29
  • Completely unrolling the loop should help. If the bit-patterns are quite sparse, the following alternative may work well if a fast count-of-trailing-zeros operation ctz() is available: `int i = 0; while (x) { i += ctz(x); *dst = i; i++; dst++; x >>= i; }`. Along similar lines, a parallel solution based on population count, popc(), seems possible for example if implemented on a GPU where each thread can work on one input bit and write the appropriate element of dst[]. This is basically a stream compaction problem. – njuffa Dec 21 '13 at 09:58
  • Sorry, that should have been `int s, i = 0; while (x) { s = ctz(x); i += s; *dst = i; dst++; i++; s++; x >>= s; }` – njuffa Dec 21 '13 at 10:52
  • Wow... your unrolled branchless loop beats ctz even for small numbers of set bits. Cool! – Andrew Jan 07 '14 at 00:50
3

As a minimal modification:

int64_t x;            
char idx[K+1];
char *dst=idx;
const int BITS = 8;
for (int i = 0 ; i < 64+BITS; i += BITS) {
  int y = (x & ((1<<BITS)-1));
  char* end = strcat(dst, tab[y]); // tab[y] is a _string_
  for (; dst != end; ++dst)
  {
    *dst += (i - 1); // tab[] is null-terminated so bit positions are 1 to BITS.
  }
  x >>= BITS;
}

The choice of BITS determines the size of the table. 8, 13 and 16 are logical choices. Each entry is a string, zero-terminated and containing bit positions with 1 offset. I.e. tab[5] is "\x03\x01". The inner loop fixes this offset.

Slightly more efficient: replace the strcat and inner loop by

char const* ptr = tab[y];
while (*ptr)
{
   *dst++ = *ptr++ + (i-1);
}

Loop unrolling can be a bit of a pain if the loop contains branches, because copying those branch statements doesn't help the branch predictor. I'll happily leave that decision to the compiler.

One thing I'm considering is that tab[y] is an array of pointers to strings. These are highly similar: "\x1" is a suffix of "\x3\x1". In fact, each string which doesn't start with "\x8" is a suffix of a string which does. I'm wondering how many unique strings you need, and to what degree tab[y] is in fact needed. E.g. by the logic above, tab[128+x] == tab[x]-1.

[edit]

Nevermind, you definitely need 128 tab entries starting with "\x8" since they're never the suffix of another string. Still, the tab[128+x] == tab[x]-1 rule means that you can save half the entries, but at the cost of two extra instructions: char const* ptr = tab[x & 0x7F] - ((x>>7) & 1). (Set up tab[] to point after the \x8)

MSalters
  • 173,980
  • 10
  • 155
  • 350
2

Using char wouldn't help you to increase speed but in fact often needs more ANDing and sign/zero extending while calculating. Only in the case of very large arrays that should fit in cache, smaller int types should be used

Another thing you can improve is the COPY macro. Instead of copy byte-by-byte, copy the whole word if possible

inline COPY(unsigned char *dst, unsigned char *src, int n)
{
switch(n) { // remember to align dst and src when declaring
case 8:
    *((int64_t*)dst) = *((int64_t*)src);
    break;
case 7:
    *((int32_t*)dst) = *((int32_t*)src);
    *((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
    dst[6] = src[6];
    break;
case 6:
    *((int32_t*)dst) = *((int32_t*)src);
    *((int16_t*)(dst + 4)) = *((int32_t*)(src + 4));
    break;
case 5:
    *((int32_t*)dst) = *((int32_t*)src);
    dst[4] = src[4];
    break;
case 4:
    *((int32_t*)dst) = *((int32_t*)src);
    break;
case 3:
    *((int16_t*)dst) = *((int16_t*)src);
    dst[2] = src[2];
    break;
case 2:
    *((int16_t*)dst) = *((int16_t*)src);
    break;
case 1:
    dst[0] = src[0];
    break;
case 0:
    break;
}

Also, since tabofs[x] and n[x] is often access close to each other, try putting it close in memory to make sure they are always in cache at the same time

typedef struct TAB_N
{
    int16_t n, tabofs;
} tab_n[256];

src=tab0+tab_n[b0].tabofs; COPY(dst, src, tab_n[b0].n);
src=tab0+tab_n[b1].tabofs; COPY(dst, src, tab_n[b1].n);
src=tab0+tab_n[b2].tabofs; COPY(dst, src, tab_n[b2].n);
src=tab0+tab_n[b3].tabofs; COPY(dst, src, tab_n[b3].n);
src=tab0+tab_n[b4].tabofs; COPY(dst, src, tab_n[b4].n);
src=tab0+tab_n[b5].tabofs; COPY(dst, src, tab_n[b5].n);

Last but not least, gettimeofday is not for performance counting. Use QueryPerformanceCounter instead, it's much more precise

Community
  • 1
  • 1
phuclv
  • 37,963
  • 15
  • 156
  • 475
1

Your code is using 1-byte (256 entries) index table. You can speed it up by factor of 2 if you use 2-byte (65536 entries) index table.

Unfortunately, you probably cannot extend that further - for 3-bytes table size would be 16MB, not likely to fit into CPU local cache, and it would only make things slower.

mvp
  • 111,019
  • 13
  • 122
  • 148
  • @OliCharlesworth: Haswell L1 cache is [64KB per core](http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29) - it might be just enough to actually work well. The only way to tell is to test it. – mvp Dec 21 '13 at 07:22
  • 1
    There's no need to stick to byte-sized keys. A 8192 entry table will work too. (13 bits, 5*13 > 64) – MSalters Dec 21 '13 at 13:05
  • @mvp: That's split into two 32k chunks, one for data, one for instructions. So for random data, you'd be missing L1 half the time on average. But I agree, that might still be faster overall than a 256-entry table. – Oliver Charlesworth Dec 21 '13 at 14:37
1

Assuming sparsity in number of set bits,

int count = 0;
unsigned int tmp_bitmap = x;        
while (tmp_bitmap > 0) {
    int next_psn = __builtin_ffs(tmp_bitmap) - 1;
    tmp_bitmap &= (tmp_bitmap-1);
    id[count++] = next_psn;
}
boolAeon
  • 11
  • 1
  • 2
    Can you explain why this is THE fastest way to solve the original question, and what other things you tried? – mjuarez May 30 '15 at 06:47
  • It is the fastest way *if* the set bits are sparse, because it uses a builtin gcc function to find the next set bit as it will quickly skip the zero bits. – Marcelo Pacheco Oct 26 '20 at 14:47
0

The question is what are you going to do with the collection of positions?
If you have to iterate many times over it, then yes, it might be interesting to gather them once as you are doing now, and iterate many.
But if it's for iterating just once or few times, then you might think of not creating an intermediate array of positions, and just invoke a processing block closure/function at each encountered 1 while iterating on bits.

Here is a naive example of bit iterator I wrote in Smalltalk:

LargePositiveInteger>>bitsDo: aBlock
| mask offset |
1 to: self digitLength do: [:iByte |
    offset := (iByte - 1) << 3.
    mask := (self digitAt: iByte).
    [mask = 0]
        whileFalse:
            [aBlock value: mask lowBit + offset.
            mask := mask bitAnd: mask - 1]]

A LargePositiveInteger is an Integer of arbitrary length composed of byte digits. The lowBit answer the rank of lowest bit and is implemented as a lookup table with 256 entries.

In C++ 2011 you can easily pass a closure, so it should be easy to translate.

uint64_t x;
unsigned int mask;
void (*process_bit_position)(unsigned int);
unsigned char offset = 0;
unsigned char lowBitTable[16] = {0,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0}; // 0-based, first entry is unused
while( x )
{
    mask = x & 0xFUL;
    while (mask)
    {
        process_bit_position( lowBitTable[mask]+offset );
        mask &= mask - 1;
    }
    offset += 4;
    x >>= 4;
}

The example is demonstrated with a 4 bit table, but you can easily extend it to 13 bits or more if it fits in cache.

For branch prediction, the inner loop could be rewritten as a for(i=0;i<nbit;i++) with an additional tablenbit=numBitTable[mask] then unrolled with a switch (the compiler could do it?), but I let you measure how it performs first...

aka.nice
  • 9,100
  • 1
  • 28
  • 40
0

Has this been found to be too slow?
Small and crude, but it's all in the cache and CPU registers;

void mybits(uint64_t x, unsigned char *idx)
{
  unsigned char n = 0;
  do {
    if (x & 1) *(idx++) = n;
    n++;
  } while (x >>= 1);          // If x is signed this will never end
  *idx = (unsigned char) 255; // List Terminator
}

It's still 3 times faster to unroll the loop and produce an array of 64 true/false values (which isn't quite what's wanted)

void mybits_3_2(uint64_t x, idx_type idx[])
{
#define SET(i) (idx[i] = (x & (1UL<<i)))
  SET( 0);
  SET( 1);
  SET( 2);
  SET( 3);
  ...
  SET(63);
}
9mjb
  • 571
  • 3
  • 10
0

Here's some tight code, written for 1-byte (8-bits), but it should easily, obviously expand to 64-bits.

int main(void)
{
    int x = 187;

    int ans[8] = {-1,-1,-1,-1,-1,-1,-1,-1};
    int idx = 0;

    while (x)
    {
        switch (x & ~(x-1))
        {
        case 0x01: ans[idx++] = 0; break;
        case 0x02: ans[idx++] = 1; break;
        case 0x04: ans[idx++] = 2; break;
        case 0x08: ans[idx++] = 3; break;
        case 0x10: ans[idx++] = 4; break;
        case 0x20: ans[idx++] = 5; break;
        case 0x40: ans[idx++] = 6; break;
        case 0x80: ans[idx++] = 7; break;
        }

        x &= x-1;
    }

   getchar();
   return 0;
}

Output array should be:

ans = {0,1,3,4,5,7,-1,-1};
abelenky
  • 63,815
  • 23
  • 109
  • 159
0

If I take "I need a fast way to get the position of all one bits in a 64-bit integer" literally...

I realise this is a few weeks old, however and out of curiosity, I remember way back in my assembly days with the CBM64 and Amiga using an arithmetic shift and then examining the carry flag - if it's set then the shifted bit was 1, if clear then it's zero

e.g. for an arithmetic shift left (examining from bit 64 to bit 0)....

pseudo code (ignore instruction mix etc errors and oversimplification...been a while):

    move #64+1, counter
    loop. ASL 64bitinteger       
    BCS carryset
    decctr. dec counter
    bne loop
    exit

    carryset. 
    //store #counter-1 (i.e. bit position) in datastruct indexed by counter
    jmp decctr

...I hope you get the idea.

I've not used assembly since then but I'm wondering if we could use some C++ in-line assembly similar to the above to do something similar here. We could do the whole conversion in assembly (very few lines of code), building up an appropriate data structure. C++ could simply examine the answer.

If this is possible then I'd imagine it to be pretty fast.

Jim
  • 423
  • 1
  • 4
  • 13
0

A simple solution, but perhaps not the fastest, depending on the times of the log and pow functions:

#include<math.h>

void getSetBits(unsigned long num){

    int bit;
    while(num){
        bit = log2(num);
        num -= pow(2, bit);

        printf("%i\n", bit); // use bit number
    }
}

Complexity O(D) | D is the number of set bits.