50

I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.

I have found a solution when N=1 (that is, "pick exactly one element at random from a huge sequence"):

import random
items = range(1, 10) # Imagine this is a huge sequence of unknown length
count = 1
selected = None
for item in items:
    if random.random() * count < 1:
        selected = item
    count += 1

But how can I achieve the same thing for other values of N (say, N=3)?

smci
  • 32,567
  • 20
  • 113
  • 146
akonsu
  • 28,824
  • 33
  • 119
  • 194
  • 6
    Not an answer to the question asked, but note that for built-in collections (and many others) you can just do [`random.sample(your_collection, N)`](https://docs.python.org/2/library/random.html#random.sample). – Mark Amery Nov 17 '15 at 16:51
  • You say *"without knowing the size of the sequence in advance"* but then your code example shows you using an upper-bound `range(1, 10)` . Is this really an XY question for asking "How to determine/estimate upper-bound of iterator length (without iterating)?". For example, if it was a text file, we simply get(/estimate) the file size, then divide by an estimated average/max/min line length (in characters). (and for Unicode, estimate character length in bytes) – smci Oct 07 '19 at 23:11
  • **As of 3.6/PEP 424, objects can now optionally have a `__length_hint__()`** [Can I speedup an iterable class when I know it's length in advance?](https://stackoverflow.com/questions/41582041/can-i-speedup-an-iterable-class-when-i-know-its-length-in-advance). And also, it's generally not necessary to call your entire file into memory to estimate its line-length/record-length/whatever. So, what type of data are you handling, and how do we efficiently estimate (upper-bound) for its length? – smci Oct 07 '19 at 23:45

10 Answers10

86

If your sequence is short enough that reading it into memory and randomly sorting it is acceptable, then a straightforward approach would be to just use random.shuffle:

import random
arr=[1,2,3,4]

# In-place shuffle
random.shuffle(arr)

# Take the first 2 elements of the now randomized array
print arr[0:2]
[1, 3]

Depending upon the type of your sequence, you may need to convert it to a list by calling list(your_sequence) on it, but this will work regardless of the types of the objects in your sequence.

Naturally, if you can't fit your sequence into memory or the memory or CPU requirements of this approach are too high for you, you will need to use a different solution.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Carl Bellingan
  • 1,460
  • 12
  • 7
51

Use reservoir sampling. It's a very simple algorithm that works for any N.

Here is one Python implementation, and here is another.

Community
  • 1
  • 1
NPE
  • 486,780
  • 108
  • 951
  • 1,012
47

Simplest I've found is this answer in SO, improved a bit below:

import random

my_list = [1, 2, 3, 4, 5]
how_big = 2

new_list = random.sample(my_list, how_big)

# To preserve the order of the list, you could do:
randIndex = random.sample(range(len(my_list)), how_big)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]
Solomon Vimal
  • 920
  • 12
  • 27
19

If you have python version of 3.6+ you can use choices

from random import choices

items = range(1, 10)
new_items = choices(items, k = 3)

print(new_items) 
[6, 3, 1]
Christof Henkel
  • 371
  • 2
  • 9
4

@NPE is correct, but the implementations that are being linked to are sub-optimal and not very "pythonic". Here's a better implementation:

def sample(iterator, k):
    """
    Samples k elements from an iterable object.

    :param iterator: an object that is iterable
    :param k: the number of items to sample
    """
    # fill the reservoir to start
    result = [next(iterator) for _ in range(k)]

    n = k - 1
    for item in iterator:
        n += 1
        s = random.randint(0, n)
        if s < k:
            result[s] = item

    return result

Edit As @panda-34 pointed out the original version was flawed, but not because I was using randint vs randrange. The issue is that my initial value for n didn't account for the fact that randint is inclusive on both ends of the range. Taking this into account fixes the issue. (Note: you could also use randrange since it's inclusive on the minimum value and exclusive on the maximum value.)

JesseBuesking
  • 6,496
  • 4
  • 44
  • 89
  • 1
    A quick check of `Counter(itertools.chain.from_iterable(sample(iter(range(100)), 5) for x in range(100000))) ` shows heavy and consistent bias towards the beginning of the range – panda-34 Apr 03 '16 at 14:59
  • And the culprit is using of `randint` instead of `randrange` – panda-34 Apr 03 '16 at 15:09
  • 1
    @panda-34 thanks for the heads up! I updated the answer based on your comments to address the issue. – JesseBuesking Apr 04 '16 at 18:25
  • Using `randrange` here would be neater than explicitly subtracting 1 from the end of the range just to make `randint` behave like `randrange`. – Mark Amery Sep 21 '16 at 13:53
4

Following will give you N random items from an array X

import random
list(map(lambda _: random.choice(X), range(N)))
Shubham Chaudhary
  • 47,722
  • 9
  • 78
  • 80
  • 2
    This will not give distinct elements: >>> x = ["a", "b", "c", "d", "e", "f", "g", "h", "i"] >>> list(map(lambda _: random.choice(x), range(3))) ['c', 'a', 'a'] – Gábor Nagy Jan 19 '17 at 16:27
3

It should be enough to accept or reject each new item just once, and, if you accept it, throw out a randomly chosen old item.

Suppose you have selected N items of K at random and you see a (K+1)th item. Accept it with probability N/(K+1) and its probabilities are OK. The current items got in with probability N/K, and get thrown out with probability (N/(K+1))(1/N) = 1/(K+1) so survive through with probability (N/K)(K/(K+1)) = N/(K+1) so their probabilities are OK too.

And yes I see somebody has pointed you to reservoir sampling - this is one explanation of how that works.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
2

As aix mentioned reservoir sampling works. Another option is generate a random number for every number you see and select the top k numbers.

To do it iteratively, maintain a heap of k (random number, number) pairs and whenever you see a new number insert to the heap if it is greater than smallest value in the heap.

ElKamina
  • 7,747
  • 28
  • 43
  • I like this - it's trivial to see that it works, since you're just generating a random number for every entry in the sequence and taking the top k. Reservoir sampling, on the other hand, looks at first glance like it *probably* works but it takes a little thought and calculation to prove it. – Mark Amery Jul 01 '15 at 23:18
0

This was my answer to a duplicate question (closed before I could post) that was somewhat related ("generating random numbers without any duplicates"). Since, it is a different approach than the other answers, I'll leave it here in case it provides additional insight.

from random import randint

random_nums = []
N = # whatever number of random numbers you want
r = # lower bound of number range
R = # upper bound of number range

x = 0

while x < N:
    random_num = randint(r, R) # inclusive range
    if random_num in random_nums:
        continue
    else:
        random_nums.append(random_num)
        x += 1

The reason for the while loop over the for loop is that it allows for easier implementation of non-skipping in random generation (i.e. if you get 3 duplicates, you won't get N-3 numbers).

tooty44
  • 6,829
  • 9
  • 27
  • 39
0

There's one implementation from the numpy library.

Assuming that N is smaller than the length of the array, you'd have to do the following:

# my_array is the array to be sampled from
assert N <= len(my_array)
indices = np.random.permutation(N)  # Generates shuffled indices from 0 to N-1
sampled_array = my_array[indices]

If you need to sample the whole array and not just the first N positions, then you can use:

import random
sampled_array = my_array[random.sample(len(my_array), N)]
learner
  • 3,168
  • 3
  • 18
  • 35