How to figure out which is the fastest algorithm

Question

In counting the number of bits in a word, a brute force would be something like this:

int CountNumSetBits(unsigned long n)
{
    unsigned short num_setbits = 0;
    while (n)
    {
        num_setbits += n & 1;
        n >>= 1;
    }
    return num_setbits;
}

The big O speed would be O(n) where n is the number of bits in the Word.

I thought of another way of writing the algorithm taking advantage of the fact that we an optain the first occurance of a set bit using y = x&~(x-1)

int CountNumSetBitsMethod2(unsigned long n)
{
    unsigned short num_setbits = 0;
    int y = 0;
    while (n)
    {
        y = n& ~(n - 1); // get first occurrence of '1'
        if (y) // if we have a set bit inc our counter
            ++num_setbits;
        n ^=y;  // erase the first occurrence of '1'
    }
    return num_setbits;
}

If we assume that are inputs are 50% 1's and 50% 0's it appears that the second algorithm could be twice as fast. However, the actual complexity is greater:

In method one we do the following for each bit: 1 add 1 and 1 shift

In method two we do the following for each set bit: 1 and 1 complement 1 subtraction (the result of the subtraction has to be copied to another reg) 1 compare 1 increment (if compare is true) 1 XOR

Now, in practice one can determine which algorithm is faster by performing some profiling. That is, using a stop watch mechanism and some test data and call each algorithm say a million times.

What I want to do first, however, is see how well I can estimate the speed difference by eyeballing the code (given same number of set and unset bits).

If we assume that the subtraction takes the same amount cycles as the add (approximately), and all the other operations are equal cycle wise, can one conclude that each algorithm takes about the same amount of time?

Note: I am assuming here we cannot use lookup tables.

too bad that we can only stuff 5 language tags in a question. python came to mind too. — Jean-François Fabre, Feb 08 '17 at 21:12
I think the answer to your question is here: http://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer — Icemanind, Feb 08 '17 at 21:15
The more common solution is this: https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel — Dietrich Epp, Feb 08 '17 at 21:15
I mentioned in my question, that I would like to estimate based on the code which method is faster before performing a test such as calling the functions a million times (for example) — Dave S, Feb 08 '17 at 21:16
I might be proven wrong, but I very much doubt you can really eyeball this for a real system. it depends too much on what the compiler does and microarchitectural details. — MikeMB, Feb 08 '17 at 21:16
in that case you may consider disassembling the compiled code, would be closer to the machine like that. — Jean-François Fabre, Feb 08 '17 at 21:16
probably a more elegant version of 2: https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan — RyanP, Feb 08 '17 at 21:33

chqrlie · Answer 1 · 2017-02-08T22:13:25.157

The second algorithm can be greatly simplified:

int CountNumSetBitsMethod2(unsigned long n) {
    unsigned short num_setbits = 0;
    while (n) {
        num_setbits++;
        n &= n - 1;
    }
    return num_setbits;
}

There are many more ways to compute the number of bits set in a word:

Using lookup tables for mutiple bits at a time
Using 64-bit multiplications
Using parallel addition
Using extra tricks to shave a few cycles.

Trying to determine empirically which is faster by counting cycles is not so easy because even looking at the assembly output, it is difficult to assess the impact of instruction parallelisation, pipelining, branch prediction, register renaming and contention... Modern CPUs are very sophisticated! Furthermore, the actual code generated depends on the compiler version and configuration and the timings depend on the CPU type and release... Not to mention the variability linked to the particular sets of values used (for algorithms with variable numbers of instructions).

Benchmarking is a necessary tool, but even careful benchmarking may fail to model the actual usage correctly.

Here is a great site for this kind of bit twiddling games:

http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive

I suggest you implement the different versions and perform comparative benchmarks on your system. There is no definite answer, only local optima for specific sets of conditions.

Some amazing finds:

// option 3, for at most 32-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 
     0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

A more classic one, usually considered the best method for counting bits in a 32-bit integer v:

v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

score 0 · Answer 2 · answered Feb 08 '17 at 21:35

0

first, the only way to know how fast things are is to measure them.

Second - to find the number of set bits in some bytes, build a lookup table for the number of set bits in a byte

0->0
1->1
2->1
3->2
4->1

etc.

This is a common method and very fast

You can code it by hand or create it at startup

answered Feb 08 '17 at 21:35

pm100

48,078
23
82
145

Please look at the last line in my question. Yes I know that lookup tables can be faster, but I purposely avoided them for this question. – Dave S Feb 08 '17 at 21:46
@DaveS - The optimal solution is different for various number of bits expected to be set. A table, going through all bytes, might be good if you have -say- 28 bits set on average. If you usually only have one bit, or three, chqrlie's first code will be much faster (and not use any cache space or the lookup table). – Bo Persson Feb 08 '17 at 22:12
dont see how u can have 28 bits set in a byte. Max gotta be 8. 256 byte lookup table doesnt take a lot of space or cache lines – pm100 Feb 08 '17 at 22:31
@pm100 - He is counting the bits in an `unsigned long`, so he would have to loop over the bytes (or have the world's largest lookup table :-). Looking up the numbers byte by byte would need 8 (or 4?) rounds in a loop, while removing one bit at a time could get away with 3 rounds for 3 bits. – Bo Persson Feb 08 '17 at 22:43
it would be interesting to do a timing comparison :-) – pm100 Feb 08 '17 at 22:47

How to figure out which is the fastest algorithm

2 Answers2