What is unoptimized about this code?

Question

I wrote a solution for a question on interviewstreet, here is the problem description:

https://www.interviewstreet.com/challenges/dashboard/#problem/4e91289c38bfd

Here is the solution they have given:

https://gist.github.com/1285119

Here is the solution that I coded:

#include<iostream>
#include <string.h>
using namespace std;
#define LOOKUPTABLESIZE 10000000
int popCount[2*LOOKUPTABLESIZE];
int main()
{
int numberOfTests = 0;
cin >> numberOfTests;

for(int test = 0;test<numberOfTests;test++)
{
    int startingNumber = 0;
    int endingNumber = 0;
    cin >> startingNumber >> endingNumber;

    int numberOf1s = 0;


    for(int number=startingNumber;number<=endingNumber;number++)
    {
        if(number >-LOOKUPTABLESIZE && number < LOOKUPTABLESIZE)
        {
            if(popCount[number+LOOKUPTABLESIZE] != 0)
            {
                numberOf1s += popCount[number+LOOKUPTABLESIZE];
            }
            else
            {
                popCount[number+LOOKUPTABLESIZE] =__builtin_popcount (number);
                numberOf1s += popCount[number+LOOKUPTABLESIZE];
            }
        }
        else
        {
        numberOf1s += __builtin_popcount (number);
        }
    }
    cout << numberOf1s << endl;

}

}

Can you please point me what is wrong with my code? It can only pass 3/10 of tests. The time limit is 3 seconds.

1. `string.h` instead of `cstring`. 2. `abusing namespace std`. 3. `#define` rather than `static const`. 4. return value of `cin >> numberOfTests` ignored. Now I'm bored. No job, I'm afraid. — Kerrek SB, Nov 11 '12 at 13:54
@john: http://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Other-Builtins.html: "Returns the number of 1-bits in x." — Antti, Nov 11 '12 at 13:59
OK, so basically the code is caching the return of builtin_popcount. Is that good idea? — john, Nov 11 '12 at 14:01
It is a more generic question than this,but in here it is a good idea since I know the input type. It may not be so with different input characteristics. — EralpB, Nov 11 '12 at 14:19

score 5 · Accepted Answer · edited May 23 '17 at 12:12

5

What is unoptimized about this code?

The algorithm. You are looping

for(int number=startingNumber;number<=endingNumber;number++)

computing or looking up the number of 1-bits in each. That can take a while.

A good algorithm counts the number of 1-bits in all numbers 0 <= k < n in O(log n) time using a bit of math.

Here is an implementation counting 0s in decimal expansions, the modification to make it count 1-bits shouldn't be hard.

edited May 23 '17 at 12:12

Community

1
1

answered Nov 11 '12 at 13:58

Daniel Fischer

181,706
17
308
431

Thank you, i didn't think such an algorithm exists. I'll try to find and understand it then. – EralpB Nov 11 '12 at 14:01
@EralpB: In general, brute force approaches do not work in interview questions. They appeal to both maths, data structure and algorithms knowledge, as well as your ability to string them together, discern patterns in the chaos, and reuse past experiences. – Matthieu M. Nov 11 '12 at 15:56
I tried my hand at solving the problem completely, however it's unclear if the approach I went with is similar to your counting 0s implementations. It seems to me that it is. Would you mind reviewing it ? – Matthieu M. Nov 11 '12 at 19:55
It's the same principle, @Matthieu. One thing you have not treated is that in the original problem the start and end of the range can be negative (thus you need a few checks up front to test for that, and compute the count of 1-bits in `[-N, -1]` from the count of 1-bits in `[0, N-1]` using the assumption of two's complement representation). One problem in your solution is the use of `log(double(n - 1))/Log2`, that will be wrong for large `n` close to powers of 2, and could be wrong for some smaller `n` already. Nevertheless, a good one, +1. – Daniel Fischer Nov 11 '12 at 20:13
@DanielFischer: oops, I must admit I did not really checked the interview question exactly ;) Regarding `log`, I suppose your worry is about precision. I hesitated going with `long double` but even then I am not sure it would be enough. There are bit twiddling hacks to get it though, which is why I abstracted in a function. I am not too worried about it thus. Thanks for the review. – Matthieu M. Nov 11 '12 at 20:19
@MatthieuM. You can use the `double` result as a starting point, it will never be more than 1 off, and then check whether you need to shift one left or right, or it's already correct. Or you could use `__builtin_clzl` or something like that if you're trying to squeeze out every bit of speed. – Daniel Fischer Nov 11 '12 at 20:26
@DanielFischer: Oh, did not know that builtin, it's exactly what I would need indeed. – Matthieu M. Nov 12 '12 at 08:54

Matthieu M. · Answer 2 · 2012-11-12T08:58:49.467

When looking at such a question, you need to break it down in simple pieces.

For example, suppose that you know how many 1s there are in all numbers [0, N] (let's call this ones(N)), then we have:

size_t ones(size_t N) { /* magic ! */ }

size_t count(size_t A, size_t B) {
    return ones(B) - (A ? ones(A - 1) : 0);
}

This approach has the advantage that one is probably simpler to program that count, for example using recursion. As such, a first naive attempt would be:

// Naive
size_t naive_ones(size_t N) {
    if (N == 0) { return 0; }
    return __builtin_popcount(N) + naive_ones(N-1);
}

But this is likely to be too slow. Even when simply computing the value of count(B, A) we will be computing naive_ones(A-1) twice!

Fortunately, there is always memoization to assist here, and the transformation is quite trivial:

size_t memo_ones(size_t N) {
    static std::deque<size_t> Memo(1, 0);
    for (size_t i = Memo.size(); i <= N; ++i) {
        Memo.push_back(Memo[i-1] + __builtin_popcnt(i));
    }
    return Memo[N];
}

It's likely that this helps, however the cost in terms of memory might be... crippling. Ugh. Imagine that for computing ones(1,000,000) we will occupy 8MB of memory on a 64bits computer! A sparser memoization could help (for example, only memoizing every 8th or 16th count):

// count number of ones in (A, B]
static unoptimized_count(size_t A, size_t B) {
    size_t result = 0;
    for (size_t i = A + 1; i <= B; ++i) {
        result += __builtin_popcount(i);
    }
    return result;
}

// something like this... be wary it's not tested.
size_t memo16_ones(size_t N) {
    static std::vector<size_t> Memo(1, 0);
    size_t const n16 = N  - (N % 16);
    for (size_t i = Memo.size(); i*16 <= n16; ++i) {
        Memo.push_back(Memo[i-1] + unoptimized_count(16*(i-1), 16*i);
    }
    return Memo[n16/16] + unoptimized_count(n16, N);
}

However, while it does reduce the memory cost, it does not solve the main speed issue: we must at least use __builtin_popcount B times! And for large values of B this is a killer.

The above solutions are mechanical, they did not require one ounce of thought. It turns out that interviews are not so much about writing code than they are about thinking.

Can we solve this problem more efficiently than dumbly enumerating all integers until B ?

Let's see what our brains (quite the amazing pattern machine) picks up when considering the first few entries:

N    bin  1s   ones(N)
0    0000 0    0
1    0001 1    1
2    0010 1    2
3    0011 2    4
4    0100 1    5
5    0101 2    7
6    0110 2    9
7    0111 3    12
8    1000 1    13
9    1001 2    15
10   1010 2    17
11   1011 3    20
12   1100 2    22
13   1101 3    25
14   1110 3    28
15   1111 3    32

Notice a pattern ? I do ;) The range 8-15 is built exactly like 0-7 but with one more 1 per line => it's like a transposition. And it's quite logical too, isn't it ?

Therefore, ones(15) - ones(7) = 8 + ones(7), ones(7) - ones(3) = 4 + ones(3) and ones(1) - ones(0) = 1 + ones(0).

Well, let's make this a formula:

Reminder: ones(N) = popcount(N) + ones(N-1) (almost) by definition
We now know that ones(2**n - 1) - ones(2**(n-1) - 1) = 2**(n-1) + ones(2**(n-1) - 1)

Let's make isolate ones(2**n), it's easier to deal with, note that popcount(2**n) = 1:

regroup: ones(2**n - 1) = 2**(n-1) + 2*ones(2**(n-1) - 1)
use the definition: ones(2**n) - 1 = 2**(n-1) + 2*ones(2**(n-1)) - 2
simplify: ones(2**n) = 2**(n-1) - 1 + 2*ones(2**(n-1)), with ones(1) = 1.

Quick sanity check:

1  = 2**0 => 1  (bottom)
2  = 2**1 => 2  = 2**0 - 1 + 2 * ones(1)
4  = 2**2 => 5  = 2**1 - 1 + 2 * ones(2)
8  = 2**3 => 13 = 2**2 - 1 + 2 * ones(4)
16 = 2**4 => 33 = 2**3 - 1 + 2 * ones(8)

Looks like it works!

We are not quite done though. A and B might not necessarily be powers of 2, and if we have to count all the way from 2**n to 2**n + 2**(n-1) that's still O(N)!

On the other hand, if we manage to express a number in base 2, then we should be able to leverage our newly acquired formula. The main advantage being than there are only log2(N) bits in the representation.

Let's pick an example and understand how it works: 13 = 8 + 4 + 1

... however, the count is not just merely the sum:

ones(13) != ones(8) + ones(4) + ones(1)

Let's express it in terms of the "transposition" strategy instead:

ones(13) - ones(8) = ones(5) + (13 - 8)

ones(5) - ones(4) = ones(1) + (5 - 4)

Okay, easy to do with a bit of recursion.

#include <cmath>
#include <iostream>

static double const Log2 = log(2);

// store ones(2**n) at P2Count[n]
static size_t P2Count[64] = {};

// Unfortunately, the conversion to double might lose some precision
// static size_t log2(size_t n) { return log(double(n - 1))/Log2 + 1; }

// __builtin_clz* returns the number of leading 0s
static size_t log2(size_t n) {
    if (n == 0) { return 0; }
    return sizeof(n) - __builtin_clzl(n) - 1;
}

static size_t ones(size_t n) {
    if (n == 0) { return 0; }
    if (n == 1) { return 1; }

    size_t const lg2 = log2(n);
    size_t const np2 = 1ul << lg2; // "next" power of 2

    if (np2 == n) { return P2Count[lg2]; }

    size_t const pp2 = np2 / 2; // "previous" power of 2

    return ones(pp2) + ones(n - pp2) + (n - pp2);
} // ones

// reminder: ones(2**n) = 2**(n-1) - 1 + 2*ones(2**(n-1))
void initP2Count() {
    P2Count[0] = 1;

    for (size_t i = 1; i != 64; ++i) {
        P2Count[i] = (1ul << (i-1)) - 1 + 2 * P2Count[i-1];
    }
} // initP2Count

size_t count(size_t const A, size_t const B) {
    if (A == 0) { return ones(B); }

    return ones(B) - ones(A - 1);
} // count

And a demonstration:

int main() {
    // Init table
    initP2Count();
    std::cout << "0: " << P2Count[0] << ", 1: " << P2Count[1] << ", 2: " << P2Count[2] << ", 3: " << P2Count[3] << "\n";

    for (size_t i = 0; i != 16; ++i) {
        std::cout << i << ": " << ones(i) << "\n";
    }

    std::cout << "count(7, 14): " << count(7, 14) << "\n";
}

Victory!

Note: as Daniel Fisher noted, this fails to account for negative number (but assuming two-complement it can be inferred from their positive count).

What is unoptimized about this code?

2 Answers2