2

Considering the performance, what's the best way to get random subset from an array?

Say we get an array with 90000 items, I wanna get 10000 random items from it.

One approach I'm thinking about is to get a random index from 0 to array.length and then remove the selected one from the original array by using Array.prototype.splice. Then get the next random item from the rest.

But the splice method will rearrange the index of all the items after the one we just selected and move them forward on step. Doesn't it affect the performance?

Items may duplicates, but what we select should not. Say we've selected index 0, then we should only look up the rest 1~89999.

牛さん
  • 2,884
  • 3
  • 29
  • 43

3 Answers3

3

If you want a subset of the shuffled array, you do not need to shuffle the whole array. You can stop the classic fisher-yates shuffle when you have drawn your 10000 items, leaving the other 80000 indices untouched.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
1

I would first randomize the whole array then splice of a 10000 items.

How to randomize (shuffle) a JavaScript array? Explains a good way to randomize a array in javascript

Community
  • 1
  • 1
Marco de Jongh
  • 5,270
  • 3
  • 17
  • 29
  • 1
    This sounds good, but if the amount of the items are so large, shuffling the whole array also costs time. – 牛さん May 01 '14 at 10:12
1

A reservoir sampling algorithm can do this.

Here's an attempt at implementing Knuth's "Algorithm S" from TAOCP Volume 2 Section 3.4.2:

function sample(source, size) {
    var chosen = 0,
        srcLen = source.length,
        result = new Array(size);

    for (var seen = 0; chosen < size; seen++) {
        var remainingInput = srcLen - seen,
            remainingOutput = size - chosen;

        if (remainingInput*Math.random() < remainingOutput) {
            result[chosen++] = source[seen];
        }
    }

    return result;
}

Basically it makes one pass over the input array, choosing or skipping items based on a function of a random number, the number of items remaining in the input, and the number of items remaining to be required in the output.

There are three potential problems with this code: 1. I may have mucked it up, 2. Knuth calls for a random number "between zero and one" and I'm not sure if this means the [0, 1) interval JavaScript provides or the fully closed or fully open interval, 3. it's vulnerable to PRNG bias.

The performance characteristics should be very good. It's O(srcLen). Most of the time we finish before going through the entire input. The input is accessed in order, which is a good thing if you are running your code on a computer that has a cache. We don't even waste any time reading or writing elements that don't ultimately end up in the output.

This version doesn't modify the input array. It is possible to write an in-place version, which might save some memory, but it probably wouldn't be much faster.

Samuel Edwin Ward
  • 6,526
  • 3
  • 34
  • 62