10

I have an array, A = [a1,a2,a3,...aP] with size P. I have to sample q elements from array A.

I plan to use a loop with q iterations, and randomly pick a element from A at each iteration. But how can I make sure that the picked number will be different at each iteration?

user4035
  • 22,508
  • 11
  • 59
  • 94
bit-question
  • 3,733
  • 17
  • 50
  • 63
  • For an approach faster than shuffling, search for implementations of random sampling without replacement (I remember something from the Python Cookbook, for example). Also see Donald Knuth's *The Art of Computer Programming*, section 3.4.2. – FMc Jan 22 '12 at 18:35

4 Answers4

19

The other answers all involve shuffling the array, which is O(n). It means modifying the original array (destructive) or copying the original array (memory intensive).

The first way to make it more memory efficient is not to shuffle the original array but to shuffle an array of indexes.

# Shuffled list of indexes into @deck
my @shuffled_indexes = shuffle(0..$#deck);

# Get just N of them.
my @pick_indexes = @shuffled_indexes[ 0 .. $num_picks - 1 ];  

# Pick cards from @deck
my @picks = @deck[ @pick_indexes ];

It is at least independent of the content of the @deck, but its still O(nlogn) performance and O(n) memory.

A more efficient algorithm (not necessarily faster, depends on now big your array is) is to look at each element of the array and decide if it's going to make it into the array. This is similar to how you select a random line from a file without reading the whole file into memory, each line has a 1/N chance of being picked where N is the line number. So the first line has a 1/1 chance (it's always picked). The next has a 1/2. Then 1/3 and so on. Each pick will overwrite the previous pick. This results in each line having a 1/total_lines chance.

You can work it out for yourself. A one line file has a 1/1 chance so the first one is always picked. A two line file... the first line has a 1/1 then a 1/2 chance of surviving, which is 1/2, and the second line has a 1/2 chance. For a three line file... the first line has a 1/1 chance of being picked, then a 1/2 * 2/3 chance of surviving which is 2/6 or 1/3. And so on.

The algorithm is O(n) for speed, it iterates through an unordered array once, and does not consume any more memory than is needed to store the picks.

With a little modification, this works for multiple picks. Instead of a 1/$position chance, it's $picks_left / $position. Each time a pick is successful, you decrement $picks_left. You work from the high position to the low one. Unlike before, you don't overwrite.

my $picks_left = $picks;
my $num_left = @$deck;
my @picks;
my $idx = 0;
while($picks_left > 0 ) {  # when we have all our picks, stop
    # random number from 0..$num_left-1
    my $rand = int(rand($num_left));

    # pick successful
    if( $rand < $picks_left ) {
        push @picks, $deck->[$idx];
        $picks_left--;
    }

    $num_left--;
    $idx++;
}

This is how perl5i implements its pick method (coming next release).

To understand viscerally why this works, take the example of picking 2 from a 4 element list. Each should have a 1/2 chance of being picked.

1. (2 picks, 4 items):         2/4 = 1/2

Simple enough. Next element has a 1/2 chance that an element will already have been picked, in which case it's chances are 1/3. Otherwise its chances are 2/3. Doing the math...

2. (1 or 2 picks,  3 items):   (1/3 * 1/2) + (2/3 * 1/2) = 3/6 = 1/2

Next has a 1/4 chance that both elements will already be picked (1/2 * 1/2), then it has no chance; 1/2 chance that only one will be picked, then it has 1/2; and the remaining 1/4 that no items will be picked in which case it's 2/2.

3. (0, 1 or 2 picks, 2 items): (0/2 * 1/4) + (1/2 * 2/4) + (2/2 * 1/4) = 2/8 + 1/4 = 1/2

Finally, for the last item, there's a 1/2 the previous took the last pick.

4. (0 or 1 pick, 1 items):     (0/1 * 2/4) + (1/1 * 2/4) = 1/2

Not exactly a proof, but good for convincing yourself it works.

nooj
  • 578
  • 5
  • 13
Schwern
  • 153,029
  • 25
  • 195
  • 336
  • `List::Gen` has different design goals than `perl5i`, including the ability to work with infinite ranges of numbers. It does not copy and shuffle the whole array in order to pick elements, that is completely wrong. If it did, then selecting from an infinite source such as `<1..*>->pick(5)->say` couldn't work (but it does). – Eric Strom Jan 22 '12 at 23:30
  • @EricStrom Thanks for the clarification about it being all about laziness. Didn't mean to slag on List::Gen, but I felt the performance problem was so acute to warn folks away from it unless they needed the lazy aspect. – Schwern Jan 22 '12 at 23:52
  • Yet you didn't edit your answer... Also, are you aware that your `perl5i` version of `pick` is **sorted**? (Well at least until you ask for all the elements, where it falls back to `List::Util::shuffle` and works properly) Hopefully that will be fixed before the next release. – Eric Strom Jan 23 '12 at 05:45
  • @EricStrom I saw `my $pick = $self->shuffle->take($n);` and continued to presume that it would have to shuffle `$self` before taking `$n`. Your shuffle algorithm can only shuffle the first $n elements? That's pretty neat. I'll remove the List::Gen reference entirely, it's not terribly relevant. Also, thanks for pointing out the ordering problem. It wasn't a consideration, but I see your point. – Schwern Jan 24 '12 at 00:28
  • shuffling is linear, not liearithmic. – sds Feb 12 '13 at 21:09
  • @sds By jove, you're right! I've switched it to community wiki, so feel free to fix up the answer. – Schwern Feb 15 '13 at 03:56
8

From perldoc perlfaq4:

How do I shuffle an array randomly?

If you either have Perl 5.8.0 or later installed, or if you have Scalar-List-Utils 1.03 or later installed, you can say:

use List::Util 'shuffle';
@shuffled = shuffle(@list);

If not, you can use a Fisher-Yates shuffle.

sub fisher_yates_shuffle {

    my $deck = shift;  # $deck is a reference to an array
    return unless @$deck; # must not be empty!

    my $i = @$deck;
    while (--$i) {
        my $j = int rand ($i+1);
        @$deck[$i,$j] = @$deck[$j,$i];
    }
}


# shuffle my mpeg collection
# 

my @mpeg = <audio/*/*.mp3>;
fisher_yates_shuffle( \@mpeg );    # randomize @mpeg in place
print @mpeg;

You could also use List::Gen:

my $gen = <1..10>;
print "$_\n" for $gen->pick(5);  # prints five random numbers
Zaid
  • 36,680
  • 16
  • 86
  • 155
  • 1
    List::Gen is extremely slow, poking along at a few thousand picks a second, due to its speed inefficient lazy shuffle. List::Util::shuffle is three orders of magnitude faster, a few hundred thousand picks a second. I've notified the List::Gen author. – Schwern Jan 22 '12 at 20:21
  • @Schwern => the point of List::Gen's `->pick` is that it is lazy. This allows for picking random elements from very large, or changing, or slow to access, or infinite data sources. This inevitably comes at a performance cost. I will consider using an eager algorithm that can be more heavily optimized when `->pick($n)` is called in list context, since that usage requires all of the elements to be calculated at once. – Eric Strom Jan 22 '12 at 23:41
4

You can suse the Fisher-Yates shuffle algorithm to randomly permute your array and then use a slice of the first q elements. Here's code from PerlMonks:

# randomly permutate @array in place
sub fisher_yates_shuffle
{
    my $array = shift;
    my $i = @$array;
    while ( --$i )
    {
        my $j = int rand( $i+1 );
        @$array[$i,$j] = @$array[$j,$i];
    }
}

fisher_yates_shuffle( \@array );    # permutes @array in place

You can probably optimize this by having the shuffle stop after it has q random elements selected. (The way this is written, you'd want the last q elements.)

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • This algorithm is available as [`List::Util`](http://search.cpan.org/perldoc?List::Util)`::shuffle` – ikegami Jan 22 '12 at 19:38
  • @ikegami - Indeed. However, the optimization I mentioned isn't available if you use `List::Util::shuffle`; if P is very large and q is much smaller, this might be a factor. – Ted Hopp Jan 22 '12 at 21:34
-1

You may construct second array, boolean with size P and store true for picked numbers. And when the numer is picked, check second table; in case "true" you must pick next one.

theWalker
  • 2,022
  • 2
  • 18
  • 27
  • 1
    That will be VERY slow if "q" is close to "P", especially if the two numbers are large. If q > P, it will go into an infinite loop. The standard algorithm for solving this problem is described in my answer. – Alex D Jan 22 '12 at 18:03