Pick N distinct items at random from sequence of unknown length, in only one iteration

Question

I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.

I have found a solution when N=1 (that is, "pick exactly one element at random from a huge sequence"):

import random
items = range(1, 10) # Imagine this is a huge sequence of unknown length
count = 1
selected = None
for item in items:
    if random.random() * count < 1:
        selected = item
    count += 1

But how can I achieve the same thing for other values of N (say, N=3)?

Not an answer to the question asked, but note that for built-in collections (and many others) you can just do [`random.sample(your_collection, N)`](https://docs.python.org/2/library/random.html#random.sample). — Mark Amery, Nov 17 '15 at 16:51
You say *"without knowing the size of the sequence in advance"* but then your code example shows you using an upper-bound `range(1, 10)` . Is this really an XY question for asking "How to determine/estimate upper-bound of iterator length (without iterating)?". For example, if it was a text file, we simply get(/estimate) the file size, then divide by an estimated average/max/min line length (in characters). (and for Unicode, estimate character length in bytes) — smci, Oct 07 '19 at 23:11
**As of 3.6/PEP 424, objects can now optionally have a `__length_hint__()`** [Can I speedup an iterable class when I know it's length in advance?](https://stackoverflow.com/questions/41582041/can-i-speedup-an-iterable-class-when-i-know-its-length-in-advance). And also, it's generally not necessary to call your entire file into memory to estimate its line-length/record-length/whatever. So, what type of data are you handling, and how do we efficiently estimate (upper-bound) for its length? — smci, Oct 07 '19 at 23:45

score 86 · Answer 1 · edited Jul 01 '15 at 23:04

86

If your sequence is short enough that reading it into memory and randomly sorting it is acceptable, then a straightforward approach would be to just use random.shuffle:

import random
arr=[1,2,3,4]

# In-place shuffle
random.shuffle(arr)

# Take the first 2 elements of the now randomized array
print arr[0:2]
[1, 3]

Depending upon the type of your sequence, you may need to convert it to a list by calling list(your_sequence) on it, but this will work regardless of the types of the objects in your sequence.

Naturally, if you can't fit your sequence into memory or the memory or CPU requirements of this approach are too high for you, you will need to use a different solution.

edited Jul 01 '15 at 23:04

Mark Amery

143,130
81
406
459

answered Jun 23 '14 at 12:45

Carl Bellingan

1,460
12
7

5

The array's size is *unknown* or *not possible to know*, and it might be huge. For example, randomly choosing n elements from a 100G stream. – Jackson Tale Apr 03 '15 at 15:22
Elegant solution – Mehmet Burak Sayıcı Nov 05 '20 at 09:27

score 51 · Accepted Answer · edited May 23 '17 at 12:10

51

Use reservoir sampling. It's a very simple algorithm that works for any N.

Here is one Python implementation, and here is another.

edited May 23 '17 at 12:10

Community

1
1

answered Mar 13 '12 at 18:40

NPE

486,780
108
951
1,012

Solomon Vimal · Answer 3 · 2021-06-20T00:48:06.270

47

Simplest I've found is this answer in SO, improved a bit below:

import random

my_list = [1, 2, 3, 4, 5]
how_big = 2

new_list = random.sample(my_list, how_big)

# To preserve the order of the list, you could do:
randIndex = random.sample(range(len(my_list)), how_big)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]

edited Jun 20 '21 at 00:48

answered Apr 05 '18 at 23:05

Solomon Vimal

920
12
27

6

This should be the top answer I guess – matanster Jul 11 '18 at 14:37
1

This should be the selected answer – Chau Pham Sep 04 '20 at 19:48

Christof Henkel · Answer 4 · 2018-04-12T12:59:41.033

19

If you have python version of 3.6+ you can use choices

from random import choices

items = range(1, 10)
new_items = choices(items, k = 3)

print(new_items) 
[6, 3, 1]

edited Apr 12 '18 at 12:59

answered Jan 25 '18 at 05:41

Christof Henkel

371
2
9

1

Great answer, but only available in 3.6+. – Dan Mar 08 '18 at 18:09
5

But this may have repetition! – Pawan Mishra Feb 05 '19 at 15:02
2

Unlike the other answers, this also works if `k > len(items)`. Just what I needed, thanks! – Nick K9 Feb 25 '20 at 20:39

JesseBuesking · Answer 5 · 2016-04-04T18:14:30.843

4

@NPE is correct, but the implementations that are being linked to are sub-optimal and not very "pythonic". Here's a better implementation:

def sample(iterator, k):
    """
    Samples k elements from an iterable object.

    :param iterator: an object that is iterable
    :param k: the number of items to sample
    """
    # fill the reservoir to start
    result = [next(iterator) for _ in range(k)]

    n = k - 1
    for item in iterator:
        n += 1
        s = random.randint(0, n)
        if s < k:
            result[s] = item

    return result

Edit As @panda-34 pointed out the original version was flawed, but not because I was using randint vs randrange. The issue is that my initial value for n didn't account for the fact that randint is inclusive on both ends of the range. Taking this into account fixes the issue. (Note: you could also use randrange since it's inclusive on the minimum value and exclusive on the maximum value.)

edited Apr 04 '16 at 18:14

answered May 24 '14 at 01:06

JesseBuesking

6,496
4
44
89

1

A quick check of `Counter(itertools.chain.from_iterable(sample(iter(range(100)), 5) for x in range(100000))) ` shows heavy and consistent bias towards the beginning of the range – panda-34 Apr 03 '16 at 14:59
And the culprit is using of `randint` instead of `randrange` – panda-34 Apr 03 '16 at 15:09
1

@panda-34 thanks for the heads up! I updated the answer based on your comments to address the issue. – JesseBuesking Apr 04 '16 at 18:25
Using `randrange` here would be neater than explicitly subtracting 1 from the end of the range just to make `randint` behave like `randrange`. – Mark Amery Sep 21 '16 at 13:53

score 4 · Answer 6 · answered Oct 22 '16 at 21:20

4

Following will give you N random items from an array X

import random
list(map(lambda _: random.choice(X), range(N)))

answered Oct 22 '16 at 21:20

Shubham Chaudhary

47,722
9
78
80

2

This will not give distinct elements: >>> x = ["a", "b", "c", "d", "e", "f", "g", "h", "i"] >>> list(map(lambda _: random.choice(x), range(3))) ['c', 'a', 'a'] – Gábor Nagy Jan 19 '17 at 16:27

score 3 · Answer 7 · answered Mar 13 '12 at 18:50

It should be enough to accept or reject each new item just once, and, if you accept it, throw out a randomly chosen old item.

Suppose you have selected N items of K at random and you see a (K+1)th item. Accept it with probability N/(K+1) and its probabilities are OK. The current items got in with probability N/K, and get thrown out with probability (N/(K+1))(1/N) = 1/(K+1) so survive through with probability (N/K)(K/(K+1)) = N/(K+1) so their probabilities are OK too.

And yes I see somebody has pointed you to reservoir sampling - this is one explanation of how that works.

score 2 · Answer 8 · answered Mar 13 '12 at 18:47

2

As aix mentioned reservoir sampling works. Another option is generate a random number for every number you see and select the top k numbers.

To do it iteratively, maintain a heap of k (random number, number) pairs and whenever you see a new number insert to the heap if it is greater than smallest value in the heap.

answered Mar 13 '12 at 18:47

ElKamina

7,747
28
43

I like this - it's trivial to see that it works, since you're just generating a random number for every entry in the sequence and taking the top k. Reservoir sampling, on the other hand, looks at first glance like it *probably* works but it takes a little thought and calculation to prove it. – Mark Amery Jul 01 '15 at 23:18

score 0 · Answer 9 · answered Jan 22 '17 at 01:31

This was my answer to a duplicate question (closed before I could post) that was somewhat related ("generating random numbers without any duplicates"). Since, it is a different approach than the other answers, I'll leave it here in case it provides additional insight.

from random import randint

random_nums = []
N = # whatever number of random numbers you want
r = # lower bound of number range
R = # upper bound of number range

x = 0

while x < N:
    random_num = randint(r, R) # inclusive range
    if random_num in random_nums:
        continue
    else:
        random_nums.append(random_num)
        x += 1

The reason for the while loop over the for loop is that it allows for easier implementation of non-skipping in random generation (i.e. if you get 3 duplicates, you won't get N-3 numbers).

learner · Answer 10 · 2019-12-30T17:53:19.170

There's one implementation from the numpy library.

Assuming that N is smaller than the length of the array, you'd have to do the following:

# my_array is the array to be sampled from
assert N <= len(my_array)
indices = np.random.permutation(N)  # Generates shuffled indices from 0 to N-1
sampled_array = my_array[indices]

If you need to sample the whole array and not just the first N positions, then you can use:

import random
sampled_array = my_array[random.sample(len(my_array), N)]

Pick N distinct items at random from sequence of unknown length, in only one iteration

10 Answers10

Linked

Related