Return non-duplicate random values from a very large range

Question

I would like a function that will produce k pseudo-random values from a set of n integers, zero to n-1, without repeating any previous result. k is less than or equal to n. O(n) memory is unacceptable because of the large size of n and the frequency with which I'll need to re-shuffle.

These are the methods I've considered so far:

Array: Normally if I wanted duplicate-free random values I'd shuffle an array, but that's O(n) memory. n is likely to be too large for that to work.

long nextvalue(void) {
  static long array[4000000000];
  static int s = 0;
  if (s == 0) {
    for (int i = 0; i < 4000000000; i++) array[i] = i;
    shuffle(array, 4000000000);
  }
  return array[s++];
}

n-state PRNG: There are a variety of random number generators that can be designed so as to have a period of n and to visit n unique states over that period. The simplest example would be:

long nextvalue(void) {
static long s = 0;
static const long i = 1009; // assumed co-prime to n
  s = (s + i) % n;
  return s;
}

The problem with this is that it's not necessarily easy to design a good PRNG on the fly for a given n, and it's unlikely that that PRNG will approximate a fair shuffle if it doesn't have a lot of variable parameters (even harder to design). But maybe there's a good one I don't know about.

m-bit hash: If the size of the set is a power of two, then it's possible to devise a perfect hash function f() which performs a 1:1 mapping from any value in the range to some other value in the range, where every input produces a unique output. Using this function I could simply maintain a static counter s, and implement a generator as:

long nextvalue(void) {
  static long s = 0;
  return f(s++);
}

This isn't ideal because the order of the results is determined by f(), rather than random values, so it's subject to all the same problems as above.

NPOT hash: In principle I can use the same design principles as above to define a version of f() which works in an arbitrary base, or even a composite, that is compatible with the range needed; but that's potentially difficult, and I'm likely to get it wrong. Instead a function can be defined for the next power of two greater than or equal to n, and used in this construction:

long nextvalue(void) {
  static long s = 0;
  long x = s++;
  do { x = f(x); } while (x >= n);
}

But this still have the same problem (unlikely to give a good approximation of a fair shuffle).

Is there a better way to handle this situation? Or perhaps I just need a good function for f() that is highly parameterisable and easy to design to visit exactly n discrete states.

One thing I'm thinking of is a hash-like operation where I contrive to have the first j results perfectly random through carefully designed mapping, and then any results between j and k would simply extrapolate on that pattern (albeit in a predictable way). The value j could then be chosen to find a compromise between a fair shuffle and a tolerable memory footprint.

How about using a cryptographically secure PRNG? How many bits wide are your numbers? You could feed the output of the PRNG to a `sha1` secure hash. What is are ballbark values for `n` and `k`. Chances are fair to good that `k` will be below the periodicity of the PRNG. Do you care about time? Are you going the grab the random number, use it, and discard it? Can you accept a probability of uniqueness (vs. absolute)? — Craig Estey, Jul 20 '16 at 05:13
While I haven't completely thought though all ways to doing it, it seems like the kicker is regardless whether you shuffle an array or generate the values and keep track of previous to insure uniqueness, you essentially require an array of `n`. The size issue can be addressed somewhat if you generate and use an array of `char` to keep track of the prior generated values. (storage is `1-byte` per value instead of 4, 8, etc..) — David C. Rankin, Jul 20 '16 at 05:52
How do you intend to check what you already got/didn't get without using `O(k)` memory, with `k` being potentially as big as `n`? — Mad Physicist, Jul 20 '16 at 16:57
OK, so as I explained in the question, there are a variety of algorithms that can iterate pseudo-randomly through a list and touch every point without repetition (up to n, but not beyond). The problem with them is that they don't have free choice on the order of the results that they give, and so after seeing one or two results you can trivially predict all subsequent results. _Completely_ free choice seems impossible (though I can't prove it), but a demonstrably strong (hard to predict) randomised choice would be sufficient. — sh1, Jul 20 '16 at 17:08
The theory behind [format-preserving encryption](https://en.wikipedia.org/wiki/Format-preserving_encryption) looks promising. Not that it's theoretically better than a transparent hash for quality distribution, but because ciphers can (must) be unpredictable and use keys so large that they inherently choose from a very large variety of orders. — sh1, Jul 20 '16 at 18:46
@CraigEstey, `n` will be about four billion (2**32, or a nearby value), fast-out tests may restrict `k` to being substantially smaller normally, but it should reach as high as `n` at least sometimes. When `k` reaches `n` I really do need to know that I've covered every possible value. After that many operations time can't help but matter. — sh1, Jul 21 '16 at 18:44
(1) Do you need to retain a list of the exact values? This would be O(n) integers or 17 GB. Or, (2) you want to generate the values one-at-a-time, guaranteeing uniqueness, and discard? I would assume (1) is _no_ [based on your **bold** text :-)]. So, if (2), would 500 MB [or _less_] be acceptable? If so, I have a solution. Can you elaborate slightly on any additional constraints? — Craig Estey, Jul 21 '16 at 19:04
@CraigEstey, I'll use them and then throw them away, one at a time. I just need to eventually visit every value in a randomised order (unless there's a fast-out condition that stops before the end). 500MB is too large, I think, unless I don't need to access the whole thing when a fast-out case stops k at a smaller value. Unfortunately I don't have a distribution for those cases yet. — sh1, Jul 22 '16 at 04:19
I suspect the answer might be [here](http://statweb.stanford.edu/~cgates/PERSI/papers/subgroup-rand-var.pdf), but I don't understand any of it. However, I found that while searching for properties of random binary matrices, which would offer a reasonable parameter space for bijective transforms if I could select a good subset uniformly. Things are looking positive! — sh1, Sep 14 '16 at 22:50

autistic · Answer 1 · 2016-08-10T06:28:39.553

2

First of all, it seems unreasonable to discount anything that uses O(n) memory and then discuss a solution that refers to an underlying array. You have an array. Shuffle it. If that doesn't work or isn't fast enough, come back to us with a question about it.

You only need to perform a complete shuffle once. After that, draw from index n, swap that element with an element located randomly before it and increase n, modulo element count. For example, with such a large dataset I'd use something like this.

Prime numbers are an option for hashes, but probably not the same way you think. Using two Mersenne primes (low and high, perhaps 0xefff and 0xefffffff) you should be able to come up with a much more general-purpose hashing algorithm.

size_t hash(unsigned char *value, size_t value_size, size_t low, size_t high) {
    size_t x = 0;
    while (value_size--) {
        x += *value++;
        x *= low;
    }
    return x % high;
}
#define hash(value, value_size, low, high) (hash((void *) value, value_size, low, high))

This should produce something fairly well distributed for all inputs larger than about two octets for example, with the minor troublesome exception for zero byte prefixes. You might want to treat those differently.

edited Aug 10 '16 at 06:28

answered Jul 21 '16 at 05:25

autistic

1
3
35
80

Shuffling the array probably won't be fast enough because it's around 16GB in size. – sh1 Jul 21 '16 at 18:48
@sh1 Really?! You only need to shuffle once; after that, swap the items you drew randomly back into the array elsewhere and you have another randomly shuffled array. – autistic Jul 22 '16 at 06:59
Also, with such a large size you're obviously not using an array with `static` storage duration; that'd be lunacy. [Shuffle the file as you read entries from it](https://gist.github.com/Sebbyastian/40be860d95f9df95f319a2f17dc74dde). – autistic Jul 22 '16 at 07:08
I actually need only the indices themselves. There is no storage of an object referenced by the index. Instead I calculate information directly from the numerical value. So if I wanted to solve it by sorting a file, I would have to create that file in advance. – sh1 Jul 29 '16 at 04:26
So you really have a 16GB allocation storing all of that? Sort it. Shuffle as you draw, as though the cards go back to the beginning of the deck and the card you select is at an increasing index... This way it stays shuffled and you can just start again at zero when you pick the last item. Why isn't a database an acceptable solution? – autistic Jul 29 '16 at 07:36

sh1 · Answer 2 · 2016-07-30T18:08:37.633

So... what I've ended up doing is digging deeper into pre-existing methods to try to confirm their ability to approximate a fair shuffle.

I take a simple counter, which itself is guaranteed to visit every in-range value exactly once, and then 'encrypt' it with an n-bit block cypher. Rather, I round the range up to a power of two, and apply some 1:1 function; then if the result is out of range I repeat the permutation until the result is in range.

This can be guaranteed to complete eventually because there are only a finite number of out-of-range values within the power-of-two range, and they cannot enter into a always-out-of-range cycle because that would imply that something in the cycle was mapped from two different previous states (one from the in-range set, and another from the out-of-range set), which would make the function not bijective.

So all I need to do is devise a parameterisable function which I can tune to an arbitrary number of bits. Like this one:

uint64_t mix(uint64_t x, uint64_t k) {
  const int s0 = BITS * 4 / 5;
  const int s1 = BITS / 5 + (k & 1);
  const int s2 = BITS * 2 / 5;
  k |= 1;

  x *= k;
  x ^= (x & BITMASK) >> s0;
  x ^= (x << s1) & BITMASK;
  x ^= (x & BITMASK) >> s2;
  x += 0x9e3779b97f4a7c15;

  return x & BITMASK;
}

I know it's bijective because I happen to have its inverse function handy:

uint64_t unmix(uint64_t x, uint64_t k) {
  const int s0 = BITS * 4 / 5;
  const int s1 = BITS / 5 + (k & 1);
  const int s2 = BITS * 2 / 5;
  k |= 1;
  uint64_t kp = k * k;
  while ((kp & BITMASK) > 1) {
    k *= kp;
    kp *= kp;
  }

  x -= 0x9e3779b97f4a7c15;
  x ^= ((x & BITMASK) >> s2) ^ ((x & BITMASK) >> s2 * 2);
  x ^= (x << s1) ^ (x << s1 * 2) ^ (x << s1 * 3) ^ (x << s1 * 4) ^ (x << s1 * 5);
  x ^= (x & BITMASK) >> s0;
  x *= k;

  return x & BITMASK;
}

This allows me to define a simple parameterisable PRNG like this:

uint64_t key[ROUNDS];
uint64_t seed = 0;
uint64_t rand_no_rep(void) {
  uint64_t x = seed++;
  do {
    for (int i = 0; i < ROUNDS; i++) x = mix(x, key[i]);
  } while (x >= RANGE);
  return x;
}

Initialise seed and key to random values and you're good to go.

Using the inverse function to lets me determine what seed must be to force rand_no_rep() to produce a given output; making it much easier to test.

So far I've checked the cases where constant a, it is followed by constant b. For ROUNDS==1 pairs collide on exactly 50% of the keys (and each pair of collisions is with a different pair of a and b; they don't all converge on 0, 1 or whatever). That is, for various k, a specific a-followed-by-b cases occurs for more than one k (this must happen at least one). Subsequent values values do not collide in that case, so different keys aren't falling into the same cycle at different positions. Every k gives a unique cycle.

50% collisions comes from 25% being not unique when they're added to the list (count itself, and count the guy it ran into). That might sound bad but it's actually lower than birthday paradox logic would suggest. Selecting randomly, the percentage of new entries that fail to be unique looks to converge between 36% and 37%. Being "better than random" is obviously worse than random, as far as randomness goes, but that's why they're called pseudo-random numbers.

Extending that to ROUNDS==2, we want to make sure that a second round doesn't cancel out or simply repeat the effects of the first.

This is important because it would mean that multiple rounds are a waste of time and memory, and that the function cannot be paramaterised to any substantial degree. It could happen trivially if mix() contained all linear operations (say, multiply and add, mod RANGE). In that case all of the parameters could be multiplied/added together to produce a single parameter for a single round that would have the same effect. That would be disappointing, as it would reduce the number of attainable permutations to the size of just that one parameter, and if the set is as small as that then more work would be needed to ensure that it's a good, representative set.

So what we want to see from two rounds is a large set of outcomes that could never be achieved by one round. One way to demonstrate this is to look for the original b-follows-a cases with an additional parameter c, where we want to see every possible c following a and b.

We know from the one-round testing that in 50% of cases there is only one c that can follow a and b because there is only one k that places b immediately after a. We also know that 25% of the pairs of a and b were unreachable (being the gap left behind by half the pairs that went into collisions rather than new unique values), and the last 25% appear for two different k.

The result that I get is that given a free choice of both keys, it's possible to find about five eights of the values of c following a given a and b. About a quarter of the a/b pairs are unreachable (it's a less predictable, now, because of the potential intermediate mappings into or out of the duplicate or unreachable cases) and a quarter have a, b, and c appear together in two sequences (which diverge afterwards).

I think there's a lot to be inferred from the difference between one round and two, but I could be wrong about that and I need to double-check. Further testing gets harder; or at least slower unless I think more carefully about how I'm going to do it.

I haven't yet demonstrated that amongst the set of permutations it can produce, that they're all equally likely; but this is normally not guaranteed for any other PRNG either.

It's fairly slow for a PRNG, but it would fit SIMD trivially.

Which system are you running this on? You're using `int` in a way that potentially invokes UB, so for anyone to find this useful they'd need more info, e.g. what is `sizeof (int)`? — autistic, Jul 30 '16 at 11:33
Also, what do you mean by "selecting randomly"? You're trying to develop a PRNG that has even distribution, right? Well if you're using such a PRNG for "selecting randomly" then you should see much less than 36-37% collisions. — autistic, Jul 30 '16 at 11:44
@Seb, good spotting, fixed. The size of the data type doesn't matter (ie., gives the same result) so long as it's at least as large as `BITS`, and unsigned so that it can overflow without UB. — sh1, Jul 30 '16 at 18:17
The PRNG has even distribution over its period, but it's incapable of producing the same result twice within its period. The collisions come when varying the `k` parameter and asking the question "if I vary `k`, do I get a different value following `a` for every different `k`?". Because the pairs are members of different sequences they're allowed to collide, (and when selecting from _all_ possible sequences at random they would frequently collide); but if they were to collide too much it would be a symptom that different `k` were failing to produce significantly different sequences. — sh1, Jul 30 '16 at 18:19
Is that an implication that when `k` is equal, not all values are produced? I'm not mathematically gifted, but I'd love to see a histogram of the first two cycles for a few poorly chosen values of `k`... I'm not sure whether or not that'd be useful for this answer, but people often look at the worst case when choosing an algorithm. — autistic, Aug 21 '16 at 01:13
Sorry that this is [more than a little] late. I had been working on a solution to _your_ problem, using a _linear congruential generator_ but got bogged down implementing part of it. Your solution is similar. I was able to post a usable portion of my code [that I would have posted here] to answer another question here: http://stackoverflow.com/questions/39215891/generating-very-large-non-repeating-integer-sequence-without-pre-shuffling/39216211#39216211 Hopefully, you'll still be able to get some benefit, but it sounds like you've already got covered. It's 1 mul and 1 add per number, so fast — Craig Estey, Aug 30 '16 at 05:45

Return non-duplicate random values from a very large range

2 Answers2

Linked