1

I need to permute N numbers between 0 and N-1 in the fastest way (on a CPU, without multi-threading, but maybe with SIMD). N is not large, I think in most cases, N<=12, so N! fits a signed 32-bit integer.

What I have tried so far is roughly the following (some optimizations are omitted, and my original code is in Java, but we speak performance in C++ if not pseudo-code):

#include <random>
#include <cstdint>
#include <iostream>

static inline uint64_t rotl(const uint64_t x, int k) {
    return (x << k) | (x >> (64 - k));
}


static uint64_t s[2];

uint64_t Next(void) {
    const uint64_t s0 = s[0];
    uint64_t s1 = s[1];
    const uint64_t result = rotl(s0 + s1, 17) + s0;

    s1 ^= s0;
    s[0] = rotl(s0, 49) ^ s1 ^ (s1 << 21); // a, b
    s[1] = rotl(s1, 28); // c

    return result;
}

// Assume the array |dest| must have enough space for N items
void GenPerm(int* dest, const int N) {
    for(int i=0; i<N; i++) {
        dest[i] = i;
    }
    uint64_t random = Next();
    for(int i=0; i+1<N; i++) {
        const int ring = (N-i);
        // I hope the compiler optimizes acquisition
        // of the quotient and modulo for the same
        // dividend and divisor pair into a single
        // CPU instruction, at least in Java it does
        const int pos = random % ring + i;
        random /= ring;
        const int t = dest[pos];
        dest[pos] = dest[i];
        dest[i] = t;
    }
}

int main() {
    std::random_device rd;
    uint32_t* seed = reinterpret_cast<uint32_t*>(s);
    for(int i=0; i<4; i++) {
        seed[i] = rd();
    }
    int dest[20];
    for(int i=0; i<10; i++) {
        GenPerm(dest, 12);
        for(int j=0; j<12; j++) {
            std::cout << dest[j] << ' ';
        }
        std::cout << std::endl;
    }
    return 0;
}

The above is slow because the CPU's modulo operation (%) is slow. I could think of generating one random number between 0 and N!-1 (inclusive); this will reduce the number of modulo operations and Next() calls, but I don't know how to proceed then. Another approach could be to replace the division operation with multiplication by the inverse integer number at the cost of small bias in the modulos generated, but I don't these inverse integers and multiplication will probably not be much faster (bitwise operations & shifts should be faster).

Any more concrete ideas?

UPDATE: I've been asked why it's a bottleneck in the real application. So I just posted a task that may be of interest to the other folks. The real task in production is:

struct Item {
    uint8_t is_free_; // 0 or 1
    // ... other members ...
};

Item* PickItem(const int time) {
    // hash-map lookup, non-empty arrays
    std::vector<std::vector<Item*>>> &arrays = GetArrays(time);
    Item* busy = nullptr;
    for(int i=0; i<arrays.size(); i++) {
        uint64_t random = Next();
        for(int j=0; j+1<arrays[i].size(); j++) {
            const int ring = (arrays[i].size()-j);
            const int pos = random % ring + j;
            random /= ring;
            Item *cur = arrays[i][pos];
            if(cur.is_free_) {
                // Return a random free item from the first array
                // where there is at least one free item
                return cur;
            }
            arrays[i][pos] = arrays[i][j];
            arrays[i][j] = cur;
        }
        Item* cur = arrays[i][arrays[i].size()-1];
        if(cur.is_free_) {
            return cur;
        } else {
            // Return the busy item in the last array if no free
            // items are found
            busy = cur;
        }
    }
    return busy;
}
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • Is calling `GenPerm` multiple times supposed to set `dest` to different values? It doesn't in my case. Please provide an [MCVE](https://stackoverflow.com/help/minimal-reproducible-example). – Nelfeal Oct 03 '22 at 18:33
  • @Nelfeal, that was because you didn't initialize the seed. I've expanded the example and checked it in an online C++ compiler. It prints 10 random permutations of 12 numbers. – Serge Rogatch Oct 03 '22 at 18:53
  • I'm curious what you're using these permutations for, that the actual _generating_ of them is the performance bottleneck rather than whatever they're used for. – Thomas Oct 03 '22 at 19:11
  • One way to avoid the modulo is to round the upper bound `ring` to the next power of two, which lets you do the modulo with a bitwise AND operation. Then, if the resulting number is larger than `ring`, just try again. Of course this comes at the expense of branching, so it might not be faster at all, but you could try it and see. – Thomas Oct 03 '22 at 19:14
  • 4
    Have you looked at `std::shuffle`? – doug Oct 03 '22 at 19:38
  • @Thomas I've edited the question to show the real task in production. A single item is selected, and then not many operations are done on it. So the `PickItem()` call subtree takes ~20% of the total runtime in the program. – Serge Rogatch Oct 03 '22 at 19:38
  • @doug I used `std::shuffle` in an application, and the unbiased generation of random numbers turned out to be the bottleneck (it is MSVC's C++ library). – j6t Oct 03 '22 at 19:47
  • 3
    Using `%` is not just slow, it also introduces the potential for [modulo bias](https://stackoverflow.com/q/10984974/2166798). To get unbiased uniformly distributed results pretty much as fast as possible, check out the code in the appendix of "Daniel Lemire. 2019. Fast Random Integer Generation in an Interval. ACM Trans. Model. Comput. Simul. 29, 1, Article 3 (February 2019), 12 pages. DOI:https://doi.org/10.1145/3230636". – pjs Oct 03 '22 at 19:48
  • @j6t Interesting. One can select the random generator but the range selection is put on top of that and that's Microsoft's code. Is that what you found a bottleneck? – doug Oct 03 '22 at 21:38
  • @doug I used `minstd_rand` (which should be faster than e.g. `mt19937`). The slow part is the bit fiddling needed to choose a random element in a range without a bias. – j6t Oct 04 '22 at 07:54

1 Answers1

1

I came up with the following solution in C++ (though not very portable to Java, because Java doesn't allow parametrizing generics with a constant - in Java I had to use polymorphism, as well as a lot of code duplication):

#include <random>
#include <cstdint>
#include <iostream>

static inline uint64_t rotl(const uint64_t x, int k) {
    return (x << k) | (x >> (64 - k));
}


static uint64_t s[2];

uint64_t Next(void) {
    const uint64_t s0 = s[0];
    uint64_t s1 = s[1];
    const uint64_t result = rotl(s0 + s1, 17) + s0;

    s1 ^= s0;
    s[0] = rotl(s0, 49) ^ s1 ^ (s1 << 21); // a, b
    s[1] = rotl(s1, 28); // c

    return result;
}

template<int N> void GenPermInner(int* dest, const uint64_t random) {
    // Because N is a constant, the compiler can optimize the division
    // by N with more lightweight operations like shifts and additions
    const int pos = random % N;
    const int t = dest[pos];
    dest[pos] = dest[0];
    dest[0] = t;
    return GenPermInner<N-1>(dest+1, random / N);
}

template<> void GenPermInner<0>(int*, const uint64_t) {
    return;
}

template<> void GenPermInner<1>(int*, const uint64_t) {
    return;
}

// Assume the array |dest| must have enough space for N items
void GenPerm(int* dest, const int N) {
    switch(N) {
    case 0:
    case 1:
        return;
    case 2:
        return GenPermInner<2>(dest, Next());
    case 3:
        return GenPermInner<3>(dest, Next());
    case 4:
        return GenPermInner<4>(dest, Next());
    case 5:
        return GenPermInner<5>(dest, Next());
    case 6:
        return GenPermInner<6>(dest, Next());
    case 7:
        return GenPermInner<7>(dest, Next());
    case 8:
        return GenPermInner<8>(dest, Next());
    case 9:
        return GenPermInner<9>(dest, Next());
    case 10:
        return GenPermInner<10>(dest, Next());
    case 11:
        return GenPermInner<11>(dest, Next());
    case 12:
        return GenPermInner<12>(dest, Next());
    // You can continue with larger numbers, so long as (N!-1) fits 64 bits
    default: {
        const uint64_t random = Next();
        const int pos = random % N;
        const int t = dest[pos];
        dest[pos] = dest[0];
        dest[0] = t;
        return GenPerm(dest+1, N-1);
    }
    }
}

int main() {
    std::random_device rd;
    uint32_t* seed = reinterpret_cast<uint32_t*>(s);
    for(int i=0; i<4; i++) {
        seed[i] = rd();
    }
    int dest[20];
    const int N = 12;
    // No need to init again and again
    for(int j=0; j<N; j++) {
        dest[j] =j;
    }
    for(int i=0; i<10; i++) {
        GenPerm(dest, N);
        // Or, if you know N at compile-time, call directly
        // GenPermInner<N>(dest, Next());
        for(int j=0; j<N; j++) {
            std::cout << dest[j] << ' ';
        }
        std::cout << std::endl;
    }
    return 0;
}
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158