14

I have the following code to pick $n elements from an array $array in PHP:

shuffle($array);
$result = array_splice($array, 0, $n);

Given a large array but only a few elements (for example 5 out of 10000), this is relatively slow, so I would like to optimize it such that not all elements have to be shuffled. The values must be unique.

I'm looking fo the most performant alternative. We can assume that $array has no duplicates and is 0-indexed.

Nikos M.
  • 8,033
  • 4
  • 36
  • 43
Fabian Schmengler
  • 24,155
  • 9
  • 79
  • 111
  • 1
    See: http://php.net/manual/en/function.array-rand.php#93834 – Rizier123 Aug 16 '15 at 13:35
  • Read that too, but I'm a bit worried about the performance of `array_flip` on a large array. – Fabian Schmengler Aug 16 '15 at 13:37
  • @FabianSchmengler thanks for the blog post and benchmarking. I think you should edit your question to briefly explain which solution (of the two in controversy) is best used in which situation for future readers. Ohh! and also, put a link to your blog post with all details. The page is already archived on [Internet Archive](https://web.archive.org/web/*/http://www.schmengler-se.de/en/2015/09/efficiently-draw-random-elements-from-large-php-array/#more-1112) – Fr0zenFyr Jun 09 '18 at 10:14

6 Answers6

12
$randomArray = [];
while (count($randomArray) < 5) {
  $randomKey = mt_rand(0, count($array)-1);
  $randomArray[$randomKey] = $array[$randomKey];
}

This will provide exactly 5 elements with no duplicates and very quickly. The keys will be preserved.

Note: You'd have to make sure $array had 5 or more elements or add some sort of check to prevent an endless loop.

d79
  • 3,699
  • 1
  • 22
  • 36
Devon Bessemer
  • 34,461
  • 9
  • 69
  • 95
  • 1
    With `n` approaching the array length I'd worry about this taking a very long time... Is there a quick way to re-index these after you've chosen them? – Paul S. Aug 16 '15 at 13:44
  • 1
    @PaulS. this all depends on the size of the array. If `n` is close to the array length, then shuffle() or another similar solution would work better. – Devon Bessemer Aug 16 '15 at 13:45
  • if efficiency is really an issue, you could also cache the length of `$array` (calculate it outside the `while`) instead of calculating it each time the `mt_rand` function is called. – Marten Koetsier Aug 16 '15 at 13:58
  • this will generate **large gaps** in the output array and not consecutive keys (as in `$n` randomly picked elements), since the output array should be of `$n` size, but sample code generates array with indices from original array, eg `array(0=>$a1, 100=>$a2,..)` – Nikos M. Aug 16 '15 at 14:31
  • plus this is **not unbiased** as the solution using `shuffle` (original OP solution) is – Nikos M. Aug 16 '15 at 14:45
  • @NikosM. could you explain why this would be biased? I don't mind the gaps, PHP handles that well enough with its *"arrays that actually are hash tables"* and I still can apply `array_values()` on the result. And so far this seems to be the fastest solution. – Fabian Schmengler Aug 16 '15 at 16:47
  • @fschmengler, yeah sure, it is biased because some picks have more probability than others while all possible picks should be equi-probable, check the wikipedia entry on fisher-yates shuffle for further info (the problem is here `mt_rand(0, count($array)-1);`) – Nikos M. Aug 16 '15 at 17:31
  • @fschmengler, also i'm not sure changing the random pick issue (mentioned in previous comment) is going to make the proposed answer unbiased, since there are other things as well, for example the final output is not shuffled (this can be fixed in `O(n)` time with another shuffle), but more importantly in the unbiased shuffle the chosen pick is removed (and that makes it unbiased, along with the counting used) – Nikos M. Aug 16 '15 at 18:18
  • But this is not Fisher Yates and I don't see how the bias explained in the article applies here. This is more like the variation mentioned in the original Fisher Yates: "They also suggested the possibility of using a simpler method — picking random numbers from one to N and discarding any duplicates—to generate the first half of the permutation, and only applying the more complex algorithm to the remaining half, where picking a duplicate number would otherwise become frustratingly common." - just that we don't create a complete permutation – Fabian Schmengler Aug 16 '15 at 19:10
  • @NikosM. I'm not sure what you're on about. Using mt_rand() is actually a more random approach to shuffle and array_rand which rely on libc. Output gaps wouldn't exist, if you want the keys in order from 0, add an array_values() call. – Devon Bessemer Aug 16 '15 at 20:33
  • @Devon and fschmengler, do an analysis on the algorithm (for example use the method described in the fisher-yates shuffle or look around SO and CS.SE) and you will see the probabilities of picks are not uniform and thus the algorithm is not unbiased. BUT the original solution which OP wants to avoid due to performance **is unbiased** since PHP's shuffle is unbiased shuffle (in fact it is a variation on fiser-yates shuffle) – Nikos M. Aug 17 '15 at 02:50
  • @fschmengler and Devon [How to prove correctness of a shuffle algorithm?](http://cs.stackexchange.com/questions/2152/how-to-prove-correctness-of-a-shuffle-algorithm) and [here](http://cs.stackexchange.com/questions/29475/how-to-fool-the-try-some-test-cases-heuristic-algorithms-that-appear-correct/29647#29647) – Nikos M. Aug 17 '15 at 02:54
  • @Devon, probably the unbiased thing can be fixed (as described in previous comments) except if OP is happy as is. But the proposed solution is not guaranted to run in `O(n)` time in worst case (especialy if `n` is comparable to `N`) since many conflicts (i.e duplicates) can appear and slow it down. My proposed answer tried to have all these but has issues with warping (as explained), i think George Reith's answer is best at present. – Nikos M. Aug 17 '15 at 03:13
  • @NikosM. I guarantee my function will be more random and faster than both of your answers. One because mt_rand provides a much better distribution of numbers than rand, so duplicates are far less common, and results will be more random. Secondly because the amount of additional manipulation your function requires will raise the time required to run each loop. The amount of time to come across a duplicate in my function is negligible. – Devon Bessemer Aug 17 '15 at 03:54
  • @Devon, ok, also it seems i was slightly wrong about unbiasedness as it can be fixed, still the worst-case issue is there and depends on use case OP has in mind, it is not an issue (or should not be an issue) of `mt_rand` vs `rand` (my sample is pseudo-code, `rand` is not specific), but in any case we can take this as granted – Nikos M. Aug 17 '15 at 03:58
  • @NikosM. Why take a complex answer that is slower and accomplishes the same thing? Even if my function runs 6 or 7 loops (which would only happen on 0.001% of occassions with a sample size of 10000), it will be faster than your function running 5 loops. I think you were wrong about everything, but you can keep using something that takes longer to accomplish the same task if you want. – Devon Bessemer Aug 17 '15 at 04:08
  • @Devon, hmm i saw this coming, wrong about everything? posted 3 comments (rest are further info) 1. large gaps wrong? NO, 2. not unbiased, wrong? NO (easily fixable though), 3. not `O(n)` performance in worst-case, wrong? NO. Try running some representative tests and see. i commented too much on this issue already – Nikos M. Aug 17 '15 at 04:12
  • @NikosM: There are no large output gaps, this preserves keys, there is a big difference. If you want numerical keys starting from 0, use array_values() as pointed out. 2. It is unbiased because the mt_rand algorithm provides a very random distribution (to 0.5%) compared to libc or other system generators. Compare it yourself if you need to verify this. It is rand() that is biased. 3. O(n) for 99.999% of cases on a 10000 sample size. For duplicates, performance will still be faster. – Devon Bessemer Aug 17 '15 at 04:17
  • @fschmengler, updated my answser with new algorithm strictly `O(n)` in both time and space and statisticaly unbiased, check it out – Nikos M. Aug 17 '15 at 05:16
  • @NikosM. and Devon thanks for the interesting discussion. I ran some benchmarks and also wrote a non formal proof that this algorithm in fact is **unbiased**. Check it out if your are interested in the results: http://www.schmengler-se.de/en/2015/09/efficiently-draw-random-elements-from-large-php-array/#more-1112 – Fabian Schmengler Sep 03 '15 at 21:58
  • @fschmengler, yeap the unbiased point was not correct (at least fully), since i saw it afterwards (and made additional comment), the unbiased part still holds though since to produce identical result with original `shuffle-splice` solution an additional `shuffle` is needed in final selection (along with `array_values`). Nice article on the issue, if it works for your case it is fine, cheers! – Nikos M. Sep 04 '15 at 11:40
  • Why exactly was this solution downvoted? It does address the situation explained by OP in fastest way. However, I'd recommend @NikosM. solution if you need to get at least `n/2` items from a big array of size `n`. – Fr0zenFyr Jun 09 '18 at 10:07
4

This function performs a shuffle on only $n elements where $n is the number of random elements you want to pick. It will also work on associative arrays and sparse arrays. $array is the array to work on and $n is the number of random elements to retrieve.

If we define the $max_index as count($array) - 1 - $iteration.

It works by generating a random number between 0 and $max_index. Picking the key at that index, and replacing its index with the value at $max_index so that it can never be picked again, as $max_index will be one less at the next iteration and unreachable.

In summary this is the Richard Durstenfeld's Fisher-Yates shuffle but operating only on $n elements instead of the entire array.

function rand_pluck($array, $n) {
    $array_keys = array_keys($array);
    $array_length = count($array_keys);
    $max_index = $array_length -1;
    $iterations = min($n, $array_length);
    $random_array = array();
    while($iterations--) {
        $index = mt_rand(0, $max_index);
        $value = $array_keys[$index];
        $array_keys[$index] = $array_keys[$max_index];
        array_push($random_array, $array[$value]);
        $max_index--;
    }
    return $random_array;
}
George Reith
  • 13,132
  • 18
  • 79
  • 148
  • yeap, variation on shuffe algorithm is **best** (similar to my answer), both performance-wise and statisticaly, i.e **unbiased sampling**, +1 – Nikos M. Aug 16 '15 at 14:48
  • strictly speaking this solution is **not** `O(n)`, but `O(N)`, since `array_keys` have to be used etc.., of course it is faster than original `shuffle` solution and unbiased (as it is a variation on `shuffle`), my solution is strictly `O(n)` but has some other issues.. – Nikos M. Aug 16 '15 at 15:07
  • @NikosM. Indeed but in reality `array_keys` is extremely fast on arrays of massive size (hundreds of thousands of elements). It's important to differentiate time complexity from actual time taken. Although I don't doubt that your method is probably faster without it I decided that the bonus of working on any array was more important than the the 10 milliseconds penalty likely to be incurred per 100k elements. – George Reith Aug 16 '15 at 17:34
  • yeah it seems we have a tradeoff here, i;m thinkng how to optimise my posted answer with another variation, else it seems your answer should be best solution – Nikos M. Aug 16 '15 at 17:42
3

The trick is to use a variation of shuffle or in other words a partial shuffle.

performance is not the only criterion, statistical efficiency, i.e unbiased sampling is as important (as the original shuffle solution is)

function random_pick( $a, $n ) 
{
  $N = count($a);
  $n = min($n, $N);
  $picked = array_fill(0, $n, 0); $backup = array_fill(0, $n, 0);
  // partially shuffle the array, and generate unbiased selection simultaneously
  // this is a variation on fisher-yates-knuth shuffle
  for ($i=0; $i<$n; $i++) // O(n) times
  { 
    $selected = mt_rand( 0, --$N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
    $value = $a[ $selected ];
    $a[ $selected ] = $a[ $N ];
    $a[ $N ] = $value;
    $backup[ $i ] = $selected;
    $picked[ $i ] = $value;
  }
  // restore partially shuffled input array from backup
  // optional step, if needed it can be ignored, e.g $a is passed by value, hence copied
  for ($i=$n-1; $i>=0; $i--) // O(n) times
  { 
    $selected = $backup[ $i ];
    $value = $a[ $N ];
    $a[ $N ] = $a[ $selected ];
    $a[ $selected ] = $value;
    $N++;
  }
  return $picked;
}

NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and produces output which is proper array with consecutive keys (not needing extra array_values etc..)

Use example:

$randomly_picked = random_pick($my_array, 5);
// or if an associative array is used
$randomly_picked_keys = random_pick(array_keys($my_array), 5);
$randomly_picked = array_intersect_key($my_array, array_flip($randomly_picked_keys));

For further variations and extensions of shuffling for PHP:

  1. PHP - shuffle only part of an array
  2. PHP shuffle with seed
  3. How can I take n elements at random from a Perl array?
Community
  • 1
  • 1
Nikos M.
  • 8,033
  • 4
  • 36
  • 43
  • It seems we have posted variations of the same algorithm. +1 for the reasons you mentioned in my answer. – George Reith Aug 16 '15 at 17:41
  • 1
    And like I said below, my algorithm is tons faster. This is about 25x slower and no more random: http://sandbox.onlinephpfunctions.com/code/68a415bbe48b07a9af10bdf8fecf4cf5e616a3ef – Devon Bessemer Aug 17 '15 at 05:57
  • 1
    @Devon, play around with the test cases and you will be surprised, do this: comment out the optional part of my code (involving backup) and use test cases with values 10, 100, 1000 especially for 10 you will be very surprised, and my code has uniform performance in all; these cases **and** is unbiased (produces true combination) http://sandbox.onlinephpfunctions.com/code/7d63e32356902c1bd49dc1dd91ad44acd2ff38a6 – Nikos M. Aug 17 '15 at 06:05
  • There is nothing more unbiased about yours. mt_rand uses a random distribution that is statistically accurate. You're so concerned about doing n loops, you're not thinking logically about how much processing you're having to do in each loop. Even with 10000 out of 100000 elements, mine is 3x faster. Beyond that is pointless for this question. – Devon Bessemer Aug 17 '15 at 06:07
  • @Devon, you confuse unbiased shuffling with optimal random number generators, it is different, you submited test cases, but you did not play around with them and did not study the overall characteristics – Nikos M. Aug 17 '15 at 06:11
  • generating random numbers and shuffling are the same logical process, if you can't understand that, then I don't know what else can be done. mt_rand shuffles the numbers from min to max and returns one... – Devon Bessemer Aug 17 '15 at 06:13
  • @Devon, ok this is my last comment, no point arguing, we all volunteer here and learn and share, let OP decide what he wants. Your example is fine, but not according to the criteria i set in this answer. Now one can have an optimal random number generator but the algorithm be such that shuffling is not unbiased (pervious links i posted display this), hope it is more clear, cheers! – Nikos M. Aug 17 '15 at 06:17
  • Thanks again for your answer. Unfortunately array handling in PHP has a huge overhead, but for PHP 7 and HHVM this is the best general-purpose solution so far (see http://www.schmengler-se.de/en/2015/09/efficiently-draw-random-elements-from-large-php-array/). I accepted @Devons answer though because for my use case it has the best results on all platforms. – Fabian Schmengler Sep 03 '15 at 22:02
  • @fschmengler, ok commented on Devon's answer as well, but note that this algorithm has uniform performance in all cases (which might be slower in some cases due to constant factors, but uniform), because for example there are absolutely no collisions (unlike other cases). Cheers – Nikos M. Sep 04 '15 at 12:02
  • Why would anyone downvote on this solution... it might not be the best solution for OP's problem but it is one of the best in general. Probably the best if you need to get `n` items from an array of size smaller than `2n`... Anyway, I loved the debate you and @Devon put up and observed couple of things closely in the code that I would normally ignore. BTW, my comment may sound **biased** because I was trying to get about `65%` random items from an array of `12600` elements. – Fr0zenFyr Jun 09 '18 at 10:01
2

You could generate n-times a random number with mt_rand() and then fill these values in a new array. To go against the case where the same index gets returned twice we use the actual returned index to fill the new array and check always if the index exists in the new array, if so we use while to loop through it as long as we get a duplicate index. At the end we use array_values() to get a 0-indexed array.

$count = count($array) - 1;
$new_array = array();
for($i = 0; $i < $n; $i++) {
    $index = mt_rand(0, $count);
    while(isset($new_array[$index])) {
        $index = mt_rand(0, $count);
    }

    $new_array[$index] = $array[$index];
}
$new_array = array_values($new_array);
Charlotte Dunois
  • 4,638
  • 2
  • 20
  • 39
2

This will only show benifits for small n compared to an array shuffle, but you could

  1. Choose a random index r n times, each time decreasing the limit by 1
  2. Adjust for previously used indices
  3. Take value
  4. Store used index

Pseudocode

arr = []
used = []
for i = 0..n-1:
    r = rand 0..len-i
    d = 0
    for j = 0..used.length-1:
        if r >= used[j]:
            d += 1
    arr.append($array[r + d])
    used.append(r)
return arr
Paul S.
  • 64,864
  • 9
  • 122
  • 138
2

I wonder why everyone here make it so complicated?

Here's the fastest and simplest way:

$randomArray = array_rand(array_flip($array), $n);
hekag71341
  • 69
  • 3