1

I have a set of 455 items from which I select at most 160 items randomly, with replacement. First I seed with srand() and then use rand() to select each number. I observe that in my selections of up to 160 items, I tend to see at least 10 items selected more than once. This seems to indicate that the random numbers are not evenly distributed.

Is there a way to have more evenly distributed random numbers?

caf
  • 233,326
  • 40
  • 323
  • 462
Lipika Deka
  • 3,774
  • 6
  • 43
  • 56
  • You (and your "senior", whatever that means) are confusing *random* with *unique*. You can get 4 all day long and the generator can still be perfectly random. – Blindy Nov 28 '11 at 05:38
  • @Blindy I think it's safe to assume that his first language is not English. You don't have to be mean about it. – john Nov 28 '11 at 05:40
  • @Blindy That's not entirely correct. Pseudo-random generators **may** be inefficient/bad, especially when combined with an inexperienced coder. I understand the point you're trying to make, but it is not answering the specific question asked. – littleadv Nov 28 '11 at 05:40
  • There's a big difference between **unique** and **random.** – Naftuli Kay Nov 28 '11 at 06:21
  • @awfullyjohn: Your generalisation edit removed key information - the specific numbers involved here are actually quite important to understanding the probabilities involved. – caf Nov 28 '11 at 06:51
  • @Juggler: are you describing the output from `rand` as is, or after a modulo? – Fred Foo Nov 28 '11 at 16:09
  • @larsmans after a modulo – Lipika Deka Nov 28 '11 at 17:03
  • @caf Thanks for the edit. Captured my question very precisely. – Lipika Deka Nov 28 '11 at 17:10
  • 1
    @juggler: the modulo destroys uniform probabilities. http://www.azillionmonkeys.com/qed/random.html – Fred Foo Nov 28 '11 at 17:46

7 Answers7

5

Your intuition about the results is wrong. If the numbers are truly random and distributed evenly between 0 and 455, then the probability of there being at least 10 duplicates in the set of 160 numbers is actually quite high (in fact it is a virtual certainty). Informally this is called the "Birthday Paradox", although it is not actually a paradox.

This chart shows the probability of different numbers of duplicates appearing when you select 160 indepedent identically distributed values with replacement from a set of 455. As you can see it is actually most likely that you get 22 duplicated values, with there being almost no chance that you get less than 10 or more than 35.

enter image description here

caf
  • 233,326
  • 40
  • 323
  • 462
  • 1
    AFAIK, you should start expecting to see repeats in half your samples at a sample size of ~25 and you should be seeing repeats in the twenties with a sample size of 160. What's the polite way of telling your senior that he needs a probability lesson? – mu is too short Nov 28 '11 at 06:07
3

It sounds like your underlying implementation of generating random numbers is working perfectly fine. What seems to be the problem is the way you are selecting items from among the total population. This wikipedia article on simple random samples describes the difference between choosing a subset with replacement vs. one without replacement. You want the latter.

Imagine you have a box with many uniquely numbered balls in it. What you are doing is choosing a ball at random, but after writing down what you have chosen, putting the ball back into the box before choosing again. This allows for the possibility of making a repeated choice. However, what you want is to eliminate that possibility once that ball has been chosen. In order to do that you will have to change the mechanism you have used for making the random choice based on your randomly generated number. A good example can be found here.

Community
  • 1
  • 1
john
  • 3,043
  • 5
  • 27
  • 48
1

Try using arc4random() from stdlib.h. It has a better pseudo-random number generation algorithm than rand(), and it doesn't require you to set an initial seed.

1

A proper set of random number generators is implemented in the gnu gsl library. You can pick from a quite variety of well tested random number generators. For serious computations, do not use rand(). In your case I would use a gsl_ran_sample from the given input set. This would look like this:

#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

#define N 455
#define K 160

int main(int argc, char **argv)
{
    double arr[N];
    double randarr[K];
    gsl_rng *r = NULL;
    const gsl_rng_type *T;
    int seed = 31456;   // intial seed of the random number generator 
    int i;

//        gsl_rng_env_setup(); // if you want to set different random number generators etc.. or set external seeds
    T = gsl_rng_ranlxs2;
    r = gsl_rng_alloc(T);
    gsl_rng_set(r, seed);

    for (i = 0; i < N; i++) {
        arr[i] = i;
    }

//      gsl_ran_choose(r, randarr, K, arr, N, sizeof(double)); // without replacement
//      in case of choose: you will need to call gsl_ran_shuffle(r, randarr, K, sizeof(double)) if you want to randomize the order.  

    gsl_ran_sample(r, randarr, K, arr, N, sizeof(double));  // with replacement
    fprintf(stdout, "Picked array elements:\n");

    for (i = 0; i < K; i++) {
        fprintf(stdout, "%f\n", randarr[i]);
    }
    gsl_rng_free(r);
    return 0;
}

if you have a properly installed gsl. compile with

gcc -Wall main.c  `gsl-config --cflags --libs`
Bort
  • 2,423
  • 14
  • 22
0

Create an array of 455 integers from 0 to 454, randomly shuffle it, then use its first 160 numbers as indices into the array of the original 455 items. That'll guarantee uniqueness of the selection.

I should add a little pic from Dilbert explaining probability in a funny way.

There's also a good article on statistical randomness on Wikipedia.

Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
  • What's the point? if the random numbers generator is not good, you'll get bad shuffling anyway. The underlying dependency on the random numbers generator is still there. – littleadv Nov 28 '11 at 05:38
  • if you just shuffle it randomly itd be impossible to get duplicate numbers - key here is you dont modify the array of 455 ints and keep them as 0-454 – DanZimm Nov 28 '11 at 05:40
  • note: I am not sure what a good algo for shuffling would be , i would ask the person who made this answer – DanZimm Nov 28 '11 at 05:40
  • @Dan, I understand why it provides for uniqueness, I just don't understand how it answers the question asked. – littleadv Nov 28 '11 at 05:42
  • Any standard library random number generator is very likely good enough (for casual use, which is what this sounds like.) Generating a random permutation and taking the leading elements from it is a typical way of getting a random subset with a determined size. – phs Nov 28 '11 at 05:44
  • Actually my system does not need totally 160 unique numbers. It can and should tolerate few similar items – Lipika Deka Nov 28 '11 at 05:45
  • @littleadv in randomly shuffling youll get a list of random ordered numbers, but i may be mistaken and the OP just wants a list of random numbers that can be duplicates just not 10 dupes. but in allowing there only to be 0-454 youll get no dupes whatsoever when shuffling as you wont add or remove any numbers just shuffle the order – DanZimm Nov 28 '11 at 05:47
  • You can always throw in a few dups "at random". – Alexey Frunze Nov 28 '11 at 05:57
0

Well, without seeing your code its hard to tell what's wrong, but you can always use the /dev/random device to draw random numbers. Just open it as a file and read from it.

However, you still might get repetitions. Just filter them out.

littleadv
  • 20,100
  • 2
  • 36
  • 50
0

While I'm sure there are other pseudo-random number generators you could use, its besides the point. Your expectations are unrealistic. Using a random number generator that I have reason to trust*, I find that you need to pick fewer than 100 items from a set of 455 in order to expect fewer than 10 items to be duplicates. If uniqueness is important, its fairly easy to implement; if randomness is important, then you're fine as is. I can just about guarantee that your random number distribution is just fine.

*I used various transformations on the raw random number to pick from a set. If there was a problem with the generator, the different transformations would have shown different biases. They did not, so I'm satisfied that the generator is close enough to random for the problem at hand.

Aaron Dufour
  • 17,288
  • 1
  • 47
  • 69