Fastest way for a random unique subset of C++ tr1 unordered_set

Question

This question is related to this one, and more precisely to this answer to it.

Here goes: I have a C++/TR1 unordered_set U of unsigned ints (rough cardinality 100-50000, rough value range 0 to 10^6). Given a cardinality N, I want to as quickly as possible iterate over N random but unique members of U. There is no typical value for N, but it should work fast for small N.

In more detail, the notion of "randomness" here is that two calls should produce somewhat different subsets -- the more different, the better, but this is not too crucial. I would e.g. be happy with a continuous (or wrapped-around continuous) block of N members of U, as long as the start index of the block is random. Non-continuous at the same cost is better, but the main concern is speed. U changes mildly, but constantly between calls (ca. 0-10 elements inserted/erased between calls).

How far I've come:

Trivial approach:
Pick random index i such that (i+N-1) < |U|. Get an iterator it to U.begin(), advance it i times using it++, and then start the actual loop over the subset. Advantage: easy. Disadvantage: waste of ++'es.
The bucket approach (and this I've "newly" derived from above link):
Pick i as above, find the bucket b in which the i-th element is in, get a local_iterator lit to U.begin(b), advance lit via lit++ until we hit the i-th element of U, and from then on keep incrementing lit for N times. If we hit the end of the bucket, we continue with lit from the beginning of the next bucket. If I want to make it more random I can pick i completely random and wrap around the buckets.

My open questions:

For point 2 above, is it really the case that I cannot somehow get an iterator into U once I've found the i-th element? This would spare me the bucket boundary control, etc. For me as quite a beginner, it seems unperceivable that the standard forward iterator should know how to continue traversing U when at the i-th item, but when I found the i-th item myself, it should not be possible to traverse U other than through point 2 above.
What else can I do? Do you know anything even much smarter/more random? If possible, I don't want to get involved in manual control of bucket sizes, hash functions, and the like, as this is a bit over my head.

The two linear time algorithms you describe seem to be for picking random elements of the set, not for picking random subsets? Also, why don't you just copy the possible iterator values to an array? Then you can pick random element in constant time. — Cheers and hth. - Alf, Dec 15 '10 at 21:57
Well, I consider picking N random elements the same as picking a subset of size N. If there is a solution that depends on it being either one, I'd still like to hear it. Could you kindly clarify on the copying, i.e., copy when at what cost, dealing how with the unordered set changing? — daspostloch, Dec 15 '10 at 21:59
If you were to write your own hashset (or otherwise get at the underlying buckets) you could probably improve performance significantly, if you don't care about having a _truly_ uniform distribution... — bdonlan, Aug 31 '11 at 01:57

templatetypedef · Answer 1 · 2011-08-31T01:12:10.040

Depending on what runtime guarantees you want, there's a famous O(n) algorithm for picking k random elements out of a stream of numbers in one pass. To understand the algorithm, let's see it first for the case where we want to pick just one element out of the set, then we'll generalize it to work for picking k elements. The advantage of this approach is that it doesn't require any advance knowledge of the size of the input set and guarantees provably uniform sampling of elements, which is always pretty nice.

Suppose that we want to pick one element out of the set. To do this, we'll make a pass over all of the elements in the set and at each point will maintain a candidate element that we're planning on returning. As we iterate across the list of elements, we'll update our guess with some probability until at the very end we've chosen a single element with uniform probability. At each point, we will maintain the following invariant:

After seeing k elements, the probability that any of the first k elements is currently chosen as the candidate element is 1 / k.

If we maintain this invariant across the entire array, then after seeing all n elements, each of them has a 1 / n chance of being the candidate element. Thus the candidate element has been sampled with uniformly random probability.

To see how the algorithm works, let's think about what it has to do to maintain the invariant. Suppose that we've just seen the very first element. To maintain the above invariant, we have to choose it with probability 1, so we'll set our initial guess of the candidate element to be the first element.

Now, when we come to the second element, we need to hold the invariant that each element is chosen with probability 1/2, since we've seen two elements. So let's suppose that with probability 1/2 we choose the second element. Then we know the following:

The probability that we've picked the second element is 1/2.
The probability that we've picked the first element is the probability that we chose it the first time around (1) times the probability that we didn't just pick the second element (1/2). This comes out to 1/2 as well.

So at this point the invariant is still maintained! Let's see what happens when we come to the third element. At this point, we need to ensure that each element is picked with probability 1/3. Well, suppose that with probability 1/3 we choose the last element. Then we know that

The probability that we've picked the third element is 1/3.
The probability that we've picked either of the first two elements is the probability that it was chosen after the first two steps (1/2) times the probability that we didn't choose the third element (2/3). This works out to 1/3.

So again, the invariant holds!

The general pattern here looks like this: After we've seen k elements, each of the elements has a 1/k chance of being picked. When we see the (k + 1)st element, we choose it with probability 1 / (k + 1). This means that it's chosen with probability 1 / (k + 1), and all of the elements before it are chosen with probability equal to the odds that we picked it before (1 / k) and didn't pick the (k + 1)st element this time (k / (k + 1)), which gives those elements each a probability of 1 / (k + 1) of being chosen. Since this maintains the invariant at each step, we've got ourselves a great algorithm:

Choose the first element as the candidate when you see it.
For each successive element, replace the candidate element with that element with probability 1 / k, where k is the number of elements seen so far.

This runs in O(n) time, requires O(1) space, and gives back a uniformly-random element out of the data stream.

Now, let's see how to scale this up to work if we want to pick k elements out of the set, not just one. The idea is extremely similar to the previous algorithm (which actually ends up being a special case of the more general one). Instead of maintaining one candidate, we maintain k different candidates, stored in an array that we number 1, 2, ..., k. At each point, we maintain this invariant:

After seeing m > k elements, the probability that any of the first m elements is chosen is k / m.

If we scan across the entire array, this means that when we're done, each element has probability k / n of being chosen. Since we're picking k different elements, this means that we sample the elements out of the array uniformly at random.

The algorithm is similar to before. First, choose the first k elements out of the set with probability 1. This means that when we've seen k elements, the probability that any of them have been picked is 1 = k / k and the invariant holds. Now, assume inductively that the invariant holds after m iterations and consider the (m + 1)st iteration. Choose a random number between 1 and (m + 1), inclusive. If we choose a number between 1 and k (inclusive), replace that candidate element with the next element. Otherwise, do not choose the next element. This means that we pick the next element with probability k / (m + 1) as required. The probability that any of the first m elements are chosen is then the probability that they were chosen before (k / m) times the probability that we didn't choose the slot containing that element (m / (m + 1)), which gives a total probability of being chosen of k / (m + 1) as required. By induction, this proves that the algorithm perfectly uniformly and randomly samples k elements out of the set!

Moreover, the runtime is O(n), which is proportional to the size of the set, which is completely independent of the number of elements you want to choose. It also uses only O(k) memory and makes no assumptions whatsoever about the type of the elements being stored.

Since you're trying to do this for C++, as a shameless self-promotion, I have an implementation of this algorithm (written as an STL algorithm) available here on my personal website. Feel free to use it!

Hope this helps!

This looks like a good solution, but how about describing the algorithm in the answer. — SingleNegationElimination, Dec 18 '10 at 21:03
Thanks, Keith, this is quite instructive. And I think the link to the description is quite sufficient. As to my actual problem, I took the time to try out both of my approaches above within the actual setting, and much to my surprise the naive one seems to perform on par at least (however not a statistically bullet-proof examination). Apart from this, I started musing whether sub-linear is at all possible for uniform, and I believe not. However, if I know n and it's large enough, I might skip elements from the stream after the first k elements in your solution. So definitely instructive-thanx! — daspostloch, Dec 19 '10 at 20:26

Fastest way for a random unique subset of C++ tr1 unordered_set

1 Answers1