Create a sequence which is ordered by bits set

Question

I'm looking for a reversible function unsigned f(unsigned) for which the number of bits set in f(i) increases with i, or at least does not decrease. Obviously, f(0) has to be 0 then, and f(~0) must come last. In between there's more flexibility. After f(0), the next 32* values must be 1U<<0 to 1U<<31, but I don't care a lot about the order (they all have 1 bit set).

I'd like an algorithm which doesn't need to calculate f(0)...f(i-1) in order to calculate f(i), and a complete table is also unworkable.

This is similar to Gray codes, but I can't see a way to reuse that algorithm. I'm trying to use this to label a large data set, and prioritize the order in which I search them. The idea is that I have a key C, and I'll check labels C ^ f(i). Low values of i should give me labels similar to C, i.e. differing in only a few bits.

[*] Bonus points for not assuming that unsigned has 32 bits.

[example] A valid initial sequence:

0, 1, 2, 4, 16, 8 ... // 16 and 8 both have one bit set, so they compare equal

An invalid initial sequence:

0, 1, 2, 3, 4 ... // 3 has two bits set, so it cannot precede 4 or 2147483648.

At first I thought you were asking for an implementation of `f(i)` which returns the number of bits set in `i`. But now I don't think so. Can you please give some example input and output to clarify what you're looking for? — John Dibling, Oct 10 '13 at 13:21
I think the questions aims at finding a total order of binary numbers, so that every number will less bits set preceeds a number with more bits set. Numbers with the same number of bits set can have arbitrary but deterministic order. — dornhege, Oct 10 '13 at 13:23
@dornhege: Indeed, the order has to be deterministic for an inverse function to exist. — MSalters, Oct 10 '13 at 13:32
Is the real aim for things that differ `in only a few bits`? e.g. like the hamming distance? — doctorlove, Oct 10 '13 at 13:36
@John Dibling Yeah I just saw the edited question. Build a sequence like this won't be that easy. — Gabriel L., Oct 10 '13 at 13:48
http://stackoverflow.com/questions/13542794/hamming-weight-based-indexing This one could be handy. — Danstahr, Oct 10 '13 at 13:49
@Danstahr: Definitely an overlap, and deep down the reasons are similar. We're both looking at a popcount of the XOR difference of two labels=addresses. — MSalters, Oct 10 '13 at 13:56
@doctorlove: Yes, but it doesn't make a difference whether you look for numbers that differ in a few bits from 0 or for numbers that differ in a few bits from some constant `C`. The two sequences are trivially transformed into each other by XOR'ing every value with C. — MSalters, Oct 10 '13 at 13:57
@MSalters: We just need to find an inverse function for the second code snippet. Which might be quite similar or at least it can inspire us. Unfortunately I don't have enough time to take a deeper look at the moment. — Danstahr, Oct 10 '13 at 14:03
@hasan: "the number of bits _which are_ set in ". In English you can often leave out such a pair of words: "the people who are walking there ..." -> "the people walking there ..." — MSalters, Oct 10 '13 at 14:52
:) I have no idea what are you talking about. can you explain with a real numbers. lets say 8, 16 and a number of your choice you bigger number of sets. — hasan, Oct 10 '13 at 14:56
Are you sure you're not OK with generating the sequence incrementally? That would be fairly easy. — harold, Oct 10 '13 at 15:04

score 3 · Accepted Answer · answered Oct 11 '13 at 09:16

Ok, seems like I have a reasonable answer. First let's define binom(n,k) as the number of ways in which we can set k out of n bits. That's the classic Pascal triangle:

1  1
1  2  1
1  3  3  1
1  4  6  4  1
1  5 10 10  5  1
1  6 15 20 15  6  1
1  7 21 35 35 21  7  1
1  8 28 56 70 56 28  8  1
...

Easily calculated and cached. Note that the sum of each line is 1<<lineNumber.

The next thing we'll need is the partial_sum of that triangle:

1  2
1  3  4
1  4  7  8
1  5 11 15 16
1  6 16 26 31  32
1  7 22 42 57  63  64
1  8 29 64 99  120 127 128
1  9 37 93 163 219 247 255 256 
...

Again, this table can be created by summing two values from the previous line, except that the new entry on each line is now 1<<line instead of 1.

Let's use these tables above to construct f(x) for an 8 bits number (it trivially generalizes to any number of bits). f(0) still has to be 0. Looking up the 8th row in the first triangle, we see that next 8 entries are f(1) to f(9), all with one bit set. The next 28 entries (7+6+5+4+3+2+1) all have 2 bits set, so that's f(10) to f(37). The next 56 entries, f(38) to f(93) have 3 bits, and there are 70 entries with 4 bits set. From symmetry we can see that they're centered around f(128), in particular they're f(94) to f(163). And obviously, the only number with 8 bits set sorts last, as f(255).

So, with these tables we can quickly determine how many bits must be set in f(i). Just do a binary search in the last row of your table. But that doesn't answer exactly which bits are set. For that we need the previous rows.

The reason that each value in the table can be created from the previous line is simple. binom(n,k) == binom(k, n-1) + binom(k-1, n-1). There are two sorts of numbers with k bits set: Those that start with a 0... and numbers which start with 1.... In the first case, the next n-1 bits must contain those k bits, in the second case the next n-1 bits must contain only k-1 bits. Special cases are of course 0 out of n and n out of n.

This same stucture can be used to quickly tell us what f(16) must be. We already had established that it must contain 2 bits set, as it falls in the range f(10) - f(37). In particular, it's number 6 with 2 bits set (starting as usual with 0). It's useful to define this as an offset in a range as we'll try to shrink the length this range from 28 down to 1.

We now subdivide that range into 21 values which start with a zero and 7 which start a one. Since 6 < 21, we know that the first digit is a zero. Of the remaining 7 bits, still 2 need to be set, so we move up a line in the triangle and see that 15 values start with two zeroes, and 6 start with 01. Since 6 < 15, f(16) starts with 00. Going further up, 7 <= 10 so it starts with 000. But 6 == 6, so it doesn't start with 0000 but 0001. At this point we change the start of the range, so the new offset becomes 0 (6-6)

We know need can focus only on the numbers that start with 0001 and have one extra bit, which are f(16)...f(19). It should be obvious by know that the range is f(16)=00010001, f(17)=00010010, f(18)=00010100, f(19)=00011000.

So, to calculate each bit, we move one row up in the triangle, compare our "remainder", add a zero or one based on the comparison possibly go left one column. That means the computational complexity of f(x) is O(bits), or O(log N), and the storage needed is O(bits*bits).

And [the inverse function](http://stackoverflow.com/a/13614164/15416), so this is efficiently reversible. — MSalters, Oct 11 '13 at 09:28

Nobody moving away from SE · Answer 2 · 2013-10-10T16:50:43.880

For each given number k we know that there are binom(n, k) n-bit integers that have exactly k bits of value one. We can now generate a lookup table of n + 1 integers that store for each k how many numbers have less one bits. This lookup table can then be used to find the number o of one bits of f(i).

Once we know this number we subtract the lookup table value for this number of bits from i which leaves us with the permutation index p for numbers with the given number of one bits. Altough I have not done research in this area I am quite sure that there exists a method for finding the pth permutation of a std::vector<bool> which is initialized with zeros and o ones in the lowest bits.

The reverse function

Again the lookup table comes in handy. We can directly calculate the number of preceding numbers with less one bits by counting the one bits in the input integer and reading in the lookup table. Then you "only" need to determine the permutation index and add it to the looked up value and you are done.

Disclaimer

Of course this is only a rough outline and some parts (especially involving the permutations) might take longer than it sounds.

Addition

You stated yourself

I'm trying to use this to label a large data set, and prioritize the order in which I search them.

Which sounds to me as if you would be going from the low hamming distance to the high hamming distance. In this case it would be enough to have an incremental version which generates the next number from the previous:

unsigned next(unsigned previous)
{
    if(std::next_permutation(previous))
        return previous;
    else
        return (1 << (1 + countOneBits(previous))) - 1;
}

Of course std::next_permutation permutation does not work this way but I think it is clear how I mean to use it.

And of those `binom(n,k)` permutations, `binom(n-1, k)` have the first bit zero, and `binom(n-1, k-1)` have the first bit set. Sounds like I'd need to walk Pascal's triangle. — MSalters, Oct 10 '13 at 14:55
`Sounds like I'd need to walk Pascal's triangle.` When? To get the permutation? — Nobody moving away from SE, Oct 10 '13 at 14:56
@MSalters: Regarding the size of the subsets were the first bit is zero/one: These calculations only work for `k = 1..n-1` but the other two cases (k = 0/n) are of course trivial. — Nobody moving away from SE, Oct 10 '13 at 15:09

Create a sequence which is ordered by bits set

2 Answers2

The reverse function

Disclaimer

Addition