Efficiently randomly shuffling the bits of a sequence of words

Question

Consider the following algorithm from the C++ standard library: std::shuffle that has the following signature:

template <class RandomIt, class URBG>
void shuffle(RandomIt first, RandomIt last, URBG&& g);

It reorders the elements in the given range [first, last) such that each possible permutation of those elements has equal probability of appearance.

I am trying to implement the same algorithms, but which works at the bit level, randomly shuffling the bits of the words of the input sequence. Considering a sequence of 64-bits words, I am trying to implement:

template <class URBG>
void bit_shuffle(std::uint64_t* first, std::uint64_t* last, URBG&& g)

Question: How to do that as efficiently as possible (using compiler intrinsics if necessary)? I am not necessarily looking for an entire implementation, but more for suggestions/directions of research, because it's really not clear to me if it's even feasible to implement that efficiently.

Rather than a shuffle, I might see this as generating a random sequence of bits, packed as a uint64 array, where the number of 1 and 0 bits equal that of the input. -- I am not say that this necessarily makes the task easier, but I think it might. — 500 - Internal Server Error, Aug 01 '19 at 20:05
@500-InternalServerError Yes, this is what I first thought... But I couldn't find a way to make that efficient (yet, at least)... — Vincent, Aug 01 '19 at 20:09
what about using `std::bitset` with 500's idea? I dont know about performance, but I love `std::bitset` ;) — 463035818_is_not_an_ai, Aug 01 '19 at 20:13
Possibly related? https://stackoverflow.com/q/17010857/1896169 It might help — Justin, Aug 01 '19 at 20:18
Can you shuffle the bytes and then do a random rotate with carry on each byte? You might have to use assembler to do the rotate with carry though. — Demolishun, Aug 01 '19 at 20:18
@Demolishun No, that would prefer some permutations to others (if I understand you correctly). E.g. if you had two words: one with all bits set and one with none, you'd only get two of the possible permutations — Justin, Aug 01 '19 at 20:20
you want to shuffle bits in the individual `uint32_t`s or along the whole sequence? — 463035818_is_not_an_ai, Aug 01 '19 at 20:39
Section 6.13 of 'Elements of Programming Interviews' by Adnan Aziz, Tsung-Hsien Lee, Amit Prakash (2015), they treat this problem. They provide an O(n)-time, O(1)-space solution, it seems. — embeddedPy, Aug 01 '19 at 20:54
@embeddedPy Indeed, you could use a simple Fisher–Yates shuffle to shuffle the whole thing via bit iterators into each bit of the word, but that's likely to leave a lot of performance on the table. Moving single bits at a time is wasteful if there's a better algorithm available. — Justin, Aug 01 '19 at 21:14
One point that may open up some more optimization opportunities is that you don't need every possible _permutation_, just every possible _combination_. Order doesn't matter, because one bit is indistinguishable from another. — Justin, Aug 01 '19 at 21:19
@embeddedPy Section 6.13 is just a random shuffle. Manipulating every single bit one by one will be extremely slow... (even if the complexity is O(n)-time and O(1)-space) — Vincent, Aug 01 '19 at 21:33
My current thought is: 1. count set bits (possibly with vector intrinsics), call this value `M`. 2. Use [Floyd's algorithm](https://stackoverflow.com/a/2394292/1896169) to select `M` bit-indices. Set those `M` indices to `1`, and all other indices to `0`. It may be better to select `Size - M` indices to set to `0` if `M > Size / 2` — Justin, Aug 01 '19 at 21:46
Do you need a true uniform probability? And what is the number of 64-bit words? — m69's been on strike for years, Aug 02 '19 at 02:13
Ideally I would like a true uniform probability. The number of 64 bit words can go from 1 to billions. — Vincent, Aug 02 '19 at 13:34

Chan Kha Vu · Accepted Answer · 2019-08-01T22:53:30.087

It's obvious that asymptotically, the speed is O(N), where N is number of bits. Our goal is to improve the constants involved in it.

Disclaimer: the description proposed algorithm is a rough sketch. There are a lot of stuffs needs to be added and, especially, a lot of details that needs to be cared of in order to make it work correctly. The approximated execution time will not be different from what is claimed here though.

Baseline Algorithm

The most obvious one is the textbook approach, which takes N operations, each of which involves calling the random_generator which takes R milliseconds, and accessing the bit's value of two different bits, and set new value to them in total of 4 * A milliseconds (A is time to read/write one bit). Suppose that the array lookup operations takes C milliseconds. So the total time of this algorithm is N * (R + 4 * A + 2 * C) milliseconds (approximately). It is also reasonable to assume that the random number generation takes more time, i.e. R >> A == C.

Proposed Algorithm

Suppose the bits are stored in a byte storage, i.e. we will work with blocks of bytes.

unsigned char bit_field[field_size = N / 8];

First, let's count the number of 1 bits in our bitset. For that, we can use a lookup-table and iterate through the bitset as byte array:

# Generate lookup-table, you may modify it with `constexpr`
# to make it run in compile time.
int bitcount_lookup[256];
for (int = 0; i < 256; ++i) {
  bitcount_lookup[i] = 0;
  for (int b = 0; b < 8; ++b)
    bitcount_lookup[i] += (i >> b) & 1;
}

We can treat this as preprocessing overhead (as it may as well be calculated at compile-time) and say that it takes 0 milliseconds. Now, counting number of 1 bits is easy (the following will take (N / 8) * C milliseconds):

int bitcount = 0;
for (auto *it = bit_field; it != bit_field + field_size; ++it)
  bitcount += bitcount_lookup[*it];

Now, we randomly generate N / 8 numbers (let's call the resulting array gencnt[N / 8]), each in the range [0..8], such that they sums up to bitcount. This is a bit tricky and kind of hard to do it uniformly (the "correct" algorithm to generate uniform distribution is quite slow comparing to the baseline algo). A quite uniform-ish but quick solution is roughly:

Fill the gencnt[N / 8] array with values v = bitcount / (N / 8).
Randomly choose N / 16 "black" cells. The rests are "white". The algorithm is similar to random permutation, but only of half of the array.
Generate N / 16 random numbers in the range [0..v]. Let's call them tmp[N / 16].
Increase "black" cells by tmp[i] values, and decrease "white" cells by tmp[i]. This will ensure that the overall sum is bitcount.

After that, we will have a uniform-ish random-ish array gencnt[N / 8], the value of which are the number of 1 bytes in a particular "cell". It was all generated in:

(N / 8) * C   +  (N / 16) * (4 * C)  +  (N / 16) * (R + 2 * C)
^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^
filling step      random coloring              filling

milliseconds (this estimation is done with a concrete implementation in my mind). Lastly, we can have a lookup table of the bytes with specified number of bits set to 1 (can be compiled overhead, or even in compile-time as constexpr, so let's assume that this takes 0 milliseconds):

std::vector<std::vector<unsigned char>> random_lookup(8);
for (int c = 0; c < 8; c++)
  random_lookup[c] = { /* numbers with `c` bits set to `1` */ };

Then, we can fill our bit_field as follows (which takes roughly (N / 8) * (R + 3 * C) milliseconds):

for (int i = 0; i < field_size; i++) {
  bit_field[i] = random_lookup[gencnt[i]][rand() % gencnt[i].size()];

Summing everything up, we have the total execution time:
T = (N / 8) * C +
    (N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C) + 
    (N / 8) * (R + 3 * C)

  = N * (C + (3/16) * R)  <  N * (R + 4 * A + 2 * C)
    ^^^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^^
     proposed algorithm        naive baseline algo
Although it's not truly uniformly random, but it does spread the bits out quite evenly and randomly, and it's quite fast and hopefully gets the job done in your use-case.

Counting the number of set bits has plenty of optimization opportunity. `std::accumulate`/`reduce` with `__builtin_popcount` is one easy way to implement it, or perhaps a [SIMD approach](https://stackoverflow.com/q/50081465/1896169) would work. — Justin, Aug 01 '19 at 23:19

score 1 · Answer 2 · answered Aug 07 '19 at 02:31

Observing that actual shuffling bits, which involves swapping via Fisher-Yates, is not required for producing the exact equivalent, a random distribution of the bits.

#include <iostream>
#include <vector>
#include <random>

// shuffle a vector of bools. This requires only counting the number of trues in the vector
// followed by clearing the vector and inserting bool trues to produce an equivalent to
// a bit shuffle. This is cache line friendly and doesn't require swapping.
std::vector<bool> DistributeBitsRandomly(std::vector<bool> bvector)
{
    std::random_device rd;
    static std::mt19937 gen(rd());  //mersenne_twister_engine seeded with rd()

    // count the number of set bits and clear bvector
    int set_bits_count = 0;
    for (int i=0; i < bvector.size(); i++)
        if (bvector[i])
        {
            set_bits_count++;
            bvector[i] = 0;
        }

    // set a bit if a random value in range bvector.size()-bit_loc-1 is
    // less than the number of bits remaining to be placed. This produces exactly the same
    // distribution as a random shuffle but only does an insertion of a 1 bit rather than
    // a swap. It requires counting the number of 1 bits. There are efficient ways
    // of doing this. See https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
    for (int bit_loc = 0; set_bits_count; bit_loc++)
    {
        std::uniform_int_distribution<int> dis(0, bvector.size()-bit_loc-1);
        auto x = dis(gen);
        if (x < set_bits_count)
        {
            bvector[bit_loc] = true;
            set_bits_count--;
        }
    }
    return bvector;
}

This performs the equivalent of shuffling the bools in a vector<bool> It is cache line friendly and involves no swapping. It's presented in executable, but simple algorithmic form as requested by the OP. Much can be done to optimize this such as improving the speed of bit counting and clearing the array.

This sets 4 bits out of 10, calls the "shuffle" routine 100,000 times, and prints the number of time a 1 bit occurs in each of the 10 locations. It should be around 40,000 in each position.

int main()
{
    std::vector<bool> initial{ 1,1,1,1,0,0,0,0,0,0 };
    std::vector<int> totals(initial.size());
    for (int i = 0; i < 100000; i++)
        {
        auto a_distribution = DistributeBitsRandomly(initial);
        for (int ii = 0; ii < totals.size(); ii++)
            if (a_distribution[ii])
                totals[ii]++;
        }
    for (auto cnt : totals)
        std::cout << cnt << "\n";
}

Possible Output:

Efficiently randomly shuffling the bits of a sequence of words

2 Answers2

Baseline Algorithm

Proposed Algorithm