88

I have a sorted list, let say: (its not really just numbers, its a list of objects that are sorted with a complicated time consuming algorithm)

mylist = [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 ,9  , 10 ]

Is there some python function that will give me N of the items, but will keep the order?

Example:

randomList = getRandom(mylist,4)
# randomList = [ 3 , 6 ,7 , 9 ]
randomList = getRandom(mylist,4)
# randomList = [ 1 , 2 , 4 , 8 ]

etc...

gsamaras
  • 71,951
  • 46
  • 188
  • 305
Yochai Timmer
  • 48,127
  • 24
  • 147
  • 185
  • 1
    Why don't you want to `random.sample` and then sort? – Daniel Lubarov Jun 26 '11 at 08:15
  • It's sorted with a non trivial algorithm... it's not really just numbers – Yochai Timmer Jun 26 '11 at 08:16
  • 4
    A very slight change to Daniel's comment: sample a range of `[0,count)`, sort the sample (the numbers in the range have a natural ordering), then extract the values from `mylist` based on the indices. Using `zip` could achieve the same effect with slightly different mechanics. –  Jun 26 '11 at 08:18
  • 1
    ok, can i get an answer + example so I have something to accept ? :) – Yochai Timmer Jun 26 '11 at 08:21

5 Answers5

127

Following code will generate a random sample of size 4:

import random

sample_size = 4
sorted_sample = [
    mylist[i] for i in sorted(random.sample(range(len(mylist)), sample_size))
]

(note: with Python 2, better use xrange instead of range)

Explanation

random.sample(range(len(mylist)), sample_size)

generates a random sample of the indices of the original list.

These indices then get sorted to preserve the ordering of elements in the original list.

Finally, the list comprehension pulls out the actual elements from the original list, given the sampled indices.

mhyfritz
  • 8,342
  • 2
  • 29
  • 29
94

Simple-to-code O(#picks*log(#picks)) way

Take a random sample without replacement of the indices, sort the indices, and take them from the original.

indices = random.sample(range(len(myList)), K)
[myList[i] for i in sorted(indices)]

random.sample(seq, K) will randomly and simultaneously pick K elements out of a population in seq, without replacement. When we do this with a range this is O(1) per sample since the range object in python is sparse and doesn't actually construct a full list (specifically the cpython implementation calls len(seq) and later seq[i]'s on the range object, which are virtualized/faked and thus O(1)). Then you look up the random indices (in order).

If you have an iterator (e.g. generator expression), you may consider first converting to a list and then doing the above answer. If your iterator is of unbounded length, you may use the technique in the next section, which is much less performant, but may be intellectually interesting (e.g. if you are working with small bounded lists in a functional language that doesn't yet support indexing, or giant streams that exceed RAM and disk size):

(Also a useful note from user tegan in the comments: If this is python2, you will want to use xrange, as usual. Otherwise you will have an O(N) rather than an O(#picks) algorithm.)


Slow/streamable O(arrayLen)-time, O(1)-auxiliary-space way for non-indexable sequences

You can alternatively use a math trick and iteratively go through myList from left to right, picking numbers with dynamically-changing probability (N-numbersPicked)/(total-numbersVisited). This approach is O(N) since it visits everything once (faster than sorting which is O(N log(N)), though much slower that directly indexing the K picks like we did in the previous section (which was O(K log(K)) after sorting).

from __future__ import division

def orderedSampleWithoutReplacement(seq, k):
    if not 0<=k<=len(seq):
        raise ValueError('Required that 0 <= sample_size <= population_size')

    numbersPicked = 0
    for i,number in enumerate(seq):
        prob = (k-numbersPicked)/(len(seq)-i)
        if random.random() < prob:
            yield number
            numbersPicked += 1

Proof: Considering the uniform distribution (without replacement) of picking a subset of k out of a population seq of size len(seq), we can consider a partition at an arbitrary point i into 'left' (0,1,...,i-1) and 'right' (i,i+1,...,len(seq)). Given that we picked numbersPicked from the left known subset, the remaining must come from the same uniform distribution on the right unknown subset, though the parameters are now different. In particular, the probability that seq[i] contains a chosen element is #remainingToChoose/#remainingToChooseFrom, or (k-numbersPicked)/(len(seq)-i), so we simulate that and recurse on the result. (This must terminate since if #remainingToChoose == #remainingToChooseFrom, then all remaining probabilities are 1.) This is similar to a probability tree that happens to be dynamically generated. Basically you can simulate a uniform probability distribution by conditioning on prior choices (as you grow the probability tree, you pick the probability of the current branch such that it is aposteriori the same as prior leaves, i.e. conditioned on prior choices; this will work because this probability is uniformly exactly N/k).

(One may look at the edit history of this post to find an elaborate simulation 'proof', which was previously necessary due to some downvotes.)

Here's another way to code it below, with more semantically-named variables.

from __future__ import division
import random

def orderedSampleWithoutReplacement(seq, sampleSize):
    totalElems = len(seq)
    if not 0<=sampleSize<=totalElems:
        raise ValueError('Required that 0 <= sample_size <= population_size')
    
    picksRemaining = sampleSize
    for elemsSeen,element in enumerate(seq):
        elemsRemaining = totalElems - elemsSeen
        prob = picksRemaining/elemsRemaining
        if random.random() < prob:
            yield element
            picksRemaining -= 1

from collections import Counter         
Counter(
    tuple(orderedSampleWithoutReplacement([0,1,2,3], 2))
    for _ in range(10**5)
)

edit: Timothy Shields mentions Reservoir Sampling, which is sort of like this method (but starts with candidate picks and swaps them out randomly), and useful when len(seq) is unknown (such as with a generator expression). Specifically the one noted as "algorithm R" is O(N) and O(1) aux space if done in-place; it involves taking the first K elements and slowly replacing them (a hint at an inductive proof is also given). There are also useful variants of reservoir sampling to be found on the wikipedia page. The idea is that you prepopulate a list of candidate return values (which we assume fits in RAM or on disk), and probabilistically swap them out as you iterate through the list (which may be arbitrarily bigger than your RAM or disk).


Optimal O(#picks * (1+log(N/#picks)))-time, O(1)-aux-space way for indexable sequences

One algorithm of note is in the Reservoir Sampling article (ctrl-F Algorithm L section "An optimal algorithm"): it is a competitive-factor optimal algorithm which is (like the original solution) O(k) in the number of samples, not O(n) in the number of elements of the list.

The intuition here is that we can skip arbitrary sections of list without having to even visit them, because the number of elements between picks is not dependent on the data we see in the list.

Instead of relying on the hypergeometric distribution as above, the fact that the reservoir is pre-populated with candidate solutions (the first k items) and periodically swapped-out, makes it apparently act more like a process with geometric waiting-time. It is a well-cited paper, but I cannot access it to tell if the proof is asymptotically correct for large N, or works for all N.

It is unclear from the article if this algorithm can be used when the length of the sequence is not known at start time (in which case one could presumably just use the original method in the first section of this answer).

ninjagecko
  • 88,546
  • 24
  • 137
  • 145
  • 1
    @pst: no disadvantage, just a speedup of `O(N)` rather `O(N log(N))` – ninjagecko Jun 26 '11 at 08:45
  • @pst: Your last claim is incorrect because the probability goes to 1 naturally if samples are not picked up. Please be so kind as to back up your first claim with math, I would be very interested if you can prove I am wrong despite my extensive simulations. – ninjagecko Jun 26 '11 at 09:40
  • 1
    Very nice, I was wondering how to do this linear approach too. Does this formula have a wikipedia page? :) – Jochen Ritzel Jun 26 '11 at 12:55
  • @Jochen: thanks! I was wondering that myself, but I couldn't find it, not sure where to add it even, maybe at http://en.wikipedia.org/wiki/Uniform_distribution_%28discrete%29 ... It might be in probability textbooks though; it's the generalization of the `[1/N,1/N-1,1/N-2,...,1]` sampling method of uniform discrete distributions to multiple values (without replacement). – ninjagecko Jun 26 '11 at 13:29
  • 2
    I'm surprised this answer doesn't have more upvotes, it actually explains how the solution works (and provides another solution!), as opposed to the first answer which is just a one-line snippet-- giving me no idea why or how it worked. – crazy2be Jun 26 '11 at 16:15
  • 1
    Nice solution ninjagecko. There's a nice inductive proof to your solution if anyone is interested in writing it up. – Neil G Jun 27 '11 at 08:55
  • 3
    Nice solution ! Don't forget to add `from __future__ import division` for those running Python 2. – xApple Jun 04 '13 at 15:14
  • You should name the algorithm in your answer: [Reservoir Sampling](http://en.wikipedia.org/wiki/Reservoir_sampling) – Timothy Shields Jan 28 '15 at 20:10
  • 1
    In this situation, you probably want to use `xrange()` not `range()`, especially if your list is long - `range()` puts all the elements in memory, while `xrange()` evaluates lazily (so you won't waste time and memory creating a list you don't need). See [here](http://stackoverflow.com/questions/94935/what-is-the-difference-between-range-and-xrange-functions-in-python-2-x) for more details – tegan Mar 04 '15 at 15:49
  • tegan: Ah yes, sorry, I'm used to coding in python3. It's not the tag the OP posted about (merely python2), but for what it's worth, `range()` is a lazy object in python3. Edited. – ninjagecko Mar 06 '15 at 20:11
  • For those running Python 2.x: `prob = (k-numbersPicked)/float(len(seq)-i)` – Amichai Aug 07 '15 at 15:16
  • @ninjagecko I tried this algorithm and it defenitly cannot work fine for any sequence. Here is a simple counter-example: http://ideone.com/FNYfj8 – Alex Zhukovskiy Jul 26 '17 at 16:01
  • 1
    @AlexZhukovskiy: (re: "I tried this algorithm and it defenitly cannot work fine for any sequence. Here is a simple counter-example.") An algorithm works if it has a valid mathematical proof like this one; the above test case is also pretty good evidence it works. I do not know C#, but I notice that your `i` variable is not even being incremented. There may be other bugs in your transcription. – ninjagecko Jul 31 '17 at 10:55
  • @ninjagecko I reread your answer and [here](http://ideone.com/Anbzf1) is fixed implementation. I agree that it seems that it guarantees to return exactly N records. I'm sorry for reading inattentively the first time. – Alex Zhukovskiy Jul 31 '17 at 11:55
11

Maybe you can just generate the sample of indices and then collect the items from your list.

randIndex = random.sample(range(len(mylist)), sample_size)
randIndex.sort()
rand = [mylist[i] for i in randIndex]
Howard
  • 38,639
  • 9
  • 64
  • 83
4

Apparently random.sample was introduced in python 2.3

so for version under that, we can use shuffle (example for 4 items):

myRange =  range(0,len(mylist)) 
shuffle(myRange)
coupons = [ bestCoupons[i] for i in sorted(myRange[:4]) ]
Yochai Timmer
  • 48,127
  • 24
  • 147
  • 185
-3

random.sample implement it.

>>> random.sample([1, 2, 3, 4, 5],  3)   # Three samples without replacement
[4, 1, 5]
xiao
  • 139
  • 1
  • 7