How to get random and unique values from a vector?

Question

Possible Duplicate:
Unique random numbers in O(1)?
Unique random numbers in an integer array in the C programming language

I have a std::vector of unique elements of some undetermined size. I want to fetch 20 unique and random elements from this vector. By 'unique' I mean that I do not want to fetch the same index more than once. Currently the way I do this is to call std::random_shuffle. But this requires me to shuffle the entire vector (which may contain over 1000 elements). I don't mind mutating the vector (I prefer not to though, as I won't need to use thread locks), but most important is that I want this to be efficient. I shouldn't be shuffling more than I need to.

Note that I've looked into passing in a partial range to std::random_shuffle but it will only ever shuffle that subset of elements, which would mean that the elements outside of that range never get used!

Help is appreciated. Thank you!

Note: I'm using Visual Studio 2005, so I do not have access to C++11 features and libraries.

already asked a lot of times, search for 'unique random numbers'... http://stackoverflow.com/questions/1608181/unique-random-numbers-in-an-integer-array-in-the-c-programming-language http://stackoverflow.com/questions/2502386/how-can-i-efficiently-select-several-unique-random-numbers-from-1-to-50-excludi — dreamzor, Dec 07 '12 at 23:14

Pubby · Answer 1 · 2012-12-07T23:18:39.677

You can use Fisher Yates http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle

The Fisher–Yates shuffle (named after Ronald Fisher and Frank Yates), also known as the Knuth shuffle (after Donald Knuth), is an algorithm for generating a random permutation of a finite set—in plain terms, for randomly shuffling the set. A variant of the Fisher–Yates shuffle, known as Sattolo's algorithm, may be used to generate random cycles of length n instead. Properly implemented, the Fisher–Yates shuffle is unbiased, so that every permutation is equally likely. The modern version of the algorithm is also rather efficient, requiring only time proportional to the number of items being shuffled and no additional storage space. The basic process of Fisher–Yates shuffling is similar to randomly picking numbered tickets out of a hat, or cards from a deck, one after another until there are no more left. What the specific algorithm provides is a way of doing this numerically in an efficient and rigorous manner that, properly done, guarantees an unbiased result.

I think this pseudocode should work (there is a chance of an off-by-one mistake or something so double check it!):

std::list chosen; // you don't have to use this since the chosen ones will be in the back of the vector
for(int i = 0; i < num; ++i) {
  int index = rand_between(0, vec.size() - i - 1);
  chosen.push_back(vec[index]);
  swap(vec[index], vec[vec.size() - i - 1]);
}

@AndrewTomazos-Fathomling What? It's only shuffling `num` times where `num` is the number of random uniques you want. — Pubby, Dec 07 '12 at 23:21
But on every iteration you don't chose an element with equal probability, right? Does it have a uniform distribution, then? — dreamzor, Dec 07 '12 at 23:22
The concept is sound but I'm not sure the pseudo-code is correct. — Mark Ransom, Dec 07 '12 at 23:22

Andrew Tomazos · Answer 2 · 2012-12-08T00:46:00.787

8

You want a random sample of size m from an n-vector:

Let rand(a) return 0..a-1 uniform

for (int i = 0; i < m; i++)
    swap(X[i],X[i+rand(n-i)]);

X[0..m-1] is now a random sample.

edited Dec 08 '12 at 00:46

answered Dec 07 '12 at 23:15

Andrew Tomazos

66,139
40
186
319

Mark Ransom · Accepted Answer · 2012-12-08T23:32:38.797

3

Use a loop to put random index numbers into a std::set and stop when the size() reaches 20.

std::set<int> indexes;
std::vector<my_vector::value_type> choices;
int max_index = my_vector.size();
while (indexes.size() < min(20, max_index))
{
    int random_index = rand() % max_index;
    if (indexes.find(random_index) == indexes.end())
    {
        choices.push_back(my_vector[random_index]);
        indexes.insert(random_index);
    }
}

The random number generation is the first thing that popped into my head, feel free to use something better.

edited Dec 08 '12 at 23:32

answered Dec 07 '12 at 23:11

Mark Ransom

299,747
42
398
622

Are you sure it well be a true uniform distribution? ;) – dreamzor Dec 07 '12 at 23:12
4

That would be a potentially infinite loop. – Joseph Mansfield Dec 07 '12 at 23:12
@sftrabbit I figured that out as I was filling in example code. I hope I've fixed it to your satisfaction. – Mark Ransom Dec 07 '12 at 23:16
@sftrabbit: only when the value array is too small, in which case the OP's desired selection isn't possible anyway. – leftaroundabout Dec 07 '12 at 23:16
@Mark Random, leftaroundabout I think that he means that, in general, you can not guarantee that such an algorithm finishes within any given time (thus a potentially infinite loop), unlike the solutions in the other answers – catchmeifyoutry Dec 07 '12 at 23:22
Guys I don't need anything "perfect" (i.e. uniform distribution). It can be a naive implementation, I don't care. Just as long as I can get a somewhat varying result each time I use it. – void.pointer Dec 07 '12 at 23:29
@catchmeifyoutry, if it doesn't finish then you need a better random number generator. This one won't do: http://dilbert.com/strips/comic/2001-10-25/ – Mark Ransom Dec 07 '12 at 23:29
I'm accepting this because: 1) It follows the KISS principle 2) It's more efficient than shuffling the whole vector 3) It doesn't require me to mutate the vector. Thanks!!! – void.pointer Dec 07 '12 at 23:36
@AndrewTomazos-Fathomling, `std::set` doesn't allow duplicates - it ignores them. That's why this works. – Mark Ransom Dec 08 '12 at 02:45
@AndrewTomazos-Fathomling, I'm not sure how you're calculating `O`. Inserting into the set is `O(log m)` which you will do `m` times (plus a few collisions, but they shouldn't affect the total too much). Iterating the set and doing a lookup on each index will be `O(m)` in total. – Mark Ransom Dec 08 '12 at 03:27
Although it likely doesn't matter for your purposes, all the dynamic allocation and pointer chasing involved in `std::set` would probably be slow in this usage. _[Why you shouldn't use set (and what you should use instead)](http://www.lafstern.org/matt/col1.pdf)_ – bames53 Dec 08 '12 at 20:41
@MarkRansom: The probability of a collision for a single iteration is `index.size() / my_vector.size()`. Every collision causes another iteration which can also collide. Calculate the expected number of iterations as a function of m and n. – Andrew Tomazos Dec 08 '12 at 23:06
@RobertDailey, I just realized a fatal flaw in this algorithm: the set doesn't store its results in the order you insert them, it keeps them sorted. The output won't be very random. Let me fix it. – Mark Ransom Dec 08 '12 at 23:28
@AndrewTomazos-Fathomling, I did a little simulation with 100000 trials to see how much of a problem this could be. To select 20 out of 20 (worst case) takes on average 72 random numbers, with a range of 24 to 263. Choosing 20 out of 100 is more reasonable, with an average of 22 and a range of 20 to 34. – Mark Ransom Dec 09 '12 at 04:24
@markransom I don't think that is a fatal flaw. the results are random, that's all that matters for me. but I would be interested to see how you solve that issue. – void.pointer Dec 09 '12 at 15:18
actually if you use unordered_set thatwould fix it – void.pointer Dec 09 '12 at 15:19
@MarkRansom: And how long does a sample of 100000 out of 100000 take? – Andrew Tomazos Dec 09 '12 at 20:14
@RobertDailey, I fixed it already, very shortly after I mentioned it. I built a result vector in parallel with the set. And `unordered_set` wouldn't fix it because it doesn't preserve the order either, and the order isn't random - just unspecified. – Mark Ransom Dec 09 '12 at 21:21

score 0 · Answer 4 · answered Dec 07 '12 at 23:21

#include <iostream>
#include <vector>
#include <algorithm>

template<int N>
struct NIntegers {
  int values[N];
};
template<int N, int Max, typename RandomGenerator>
NIntegers<N> MakeNRandomIntegers( RandomGenerator func ) {
  NIntegers<N> result;
  for(int i = 0; i < N; ++i)
  {
    result.values[i] = func( Max-i );
  }
  std::sort(&result.values[0], &result.values[0]+N);
  for(int i = 0; i < N; ++i)
  {
    result.values[i] += i;
  }
  return result;
};

Use example:

// use a better one:
int BadRandomNumberGenerator(int Max) {
  return Max>4?4:Max/2;
}
int main() {
  NIntegers<100> result = MakeNRandomIntegers<100, 500>( BadRandomNumberGenerator );
  for (int i = 0; i < 100; ++i) {
    std::cout << i << ":" << result.values[i] << "\n";
  }
}

make each number 1 smaller in max than the last. Sort them, then bump up each value by the number of integers before it.

template stuff is just trade dress.

How to get random and unique values from a vector?

4 Answers4