Random sample from a very long iterable, in python

Question

I have a long python generator that I want to "thin out" by randomly selecting a subset of values. Unfortunately, random.sample() will not work with arbitrary iterables. Apparently, it needs something that supports the len() operation (and perhaps non-sequential access to the sequence, but that's not clear). And I don't want to build an enormous list just so I can thin it out.

As a matter of fact, it is possible to sample from a sequence uniformly in one pass, without knowing its length-- there's a nice algorithm in Programming perl that does just that (edit: "reservoir sampling", thanks @user2357112!). But does anyone know of a standard python module that provides this functionality?

Demo of the problem (Python 3)

>>> import itertools, random
>>> random.sample(iter("abcd"), 2)
...
TypeError: Population must be a sequence or set.  For dicts, use list(d).

On Python 2, the error is more transparent:

Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    random.sample(iter("abcd"), 2)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/random.py", line 321, in sample
    n = len(population)
TypeError: object of type 'iterator' has no len()

If there's no alternative to random.sample(), I'd try my luck with wrapping the generator into an object that provides a __len__ method (I can find out the length in advance). So I'll accept an answer that shows how to do that cleanly.

Are you looking for reservoir sampling? This doesn't come with Python, probably because it only makes sense for crazy huge streams. Also, `__len__` won't be enough; `random.sample` needs random access. — user2357112, Feb 26 '16 at 17:35
To be clear, `random.sample` *does* work with arbitrary sequences, but not with arbitrary iterables. See https://docs.python.org/2/glossary.html — Robᵩ, Feb 26 '16 at 17:36
As you noted, you could wrap the generator in an object that provides a __len__ method. But as to the implementation details of that, it would help to know what exactly your generator is doing/ how it is implemented. — ubadub, Feb 26 '16 at 17:38
If you know the `len()` *a priori*, then you can do: `indices = random.sample(xrange(len),k)`, and then run your generator until you've extracted each indexed datum. — Robᵩ, Feb 26 '16 at 17:40
@user2357112, indeed, "reservoir sampling" is what I meant! I wasn't recalling the term, thanks. — alexis, Feb 27 '16 at 13:28
@ubadub, at the moment I had my hands on `itertools.permutations()`. But the question is intended to be general: A generator of known, large length, not supporting random access, which we want to iterate over only once. — alexis, Feb 27 '16 at 13:28
@Robᵩ, thanks 2x! Yeah, I'm hazy on the official meaning of `sequence`, `iterable`, `iterator` etc. Fixed. — alexis, Feb 27 '16 at 14:15
Here's [python code example on how to select `k` random items from an iterator using reservoir-sampling algorithm (O(n) Algorithm R)](http://stackoverflow.com/a/32792504/4279) — jfs, Feb 27 '16 at 14:39
@alexis: If you want to randomly sample from the very large set of possible permutations of a sequence, your best bet is to use `random.shuffle` and reshuffle if you get a repeat selection. It'll be way faster than reservoir sampling. This is a case of the general principle that if you can draw random elements of a stream without iterating over the stream, you probably shouldn't reservoir sample. — user2357112, Feb 27 '16 at 17:42
@user2357112, I'm afraid I don't follow at all. Or maybe you misunderstand me: `itertools.permutations()` is the function whose output I want to sample. The permuted set itself is pretty small -- in the double digits, though as large as I can practically get it. How do I use `shuffle` under these conditions? (Or did you really mean `random.sample()`?) — alexis, Feb 29 '16 at 11:10
@alexis: `permutation = l[:]; random.shuffle(permutation)`, or `permutation = random.sample(l, n)` if you're using the two-argument form of `itertools.permutations`. The permuted set may be small, but the set of permutations is going to be huge. — user2357112, Feb 29 '16 at 15:55
Oh I see, you mean generating the permutations in this way instead of sampling from the iterable. Got it, thanks. — alexis, Feb 29 '16 at 22:27
Possible duplicate of [Python random.sample with a generator](http://stackoverflow.com/questions/12581437/python-random-sample-with-a-generator) — Matti Lyra, Oct 25 '16 at 09:48

Robᵩ · Answer 1 · 2016-02-26T18:09:08.000

8

Since you know the length the data returned by your iterable, you can use xrange() to quickly generate indices into your iterable. Then you can just run the iterable until you've grabbed all of the data:

import random

def sample(it, length, k):
    indices = random.sample(xrange(length), k)
    result = [None]*k
    for index, datum in enumerate(it):
        if index in indices:
            result[indices.index(index)] = datum
    return result

print sample(iter("abcd"), 4, 2)

In the alternative, here is an implementation of resevior sampleing using "Algorithm R":

import random

def R(it, k):
    '''https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R'''
    it = iter(it)
    result = []
    for i, datum in enumerate(it):
        if i < k:
            result.append(datum)
        else:
            j = random.randint(0, i-1)
            if j < k:
                result[j] = datum
    return result

print R(iter("abcd"), 2)

Note that algorithm R doesn't provide a random order for the results. In the example given, 'b' will never precede 'a' in the results.

edited Feb 26 '16 at 18:09

answered Feb 26 '16 at 17:49

Robᵩ

163,533
20
239
308

That's a great idea, thanks! Somehow I'd established that `random.sample` works with `(x)range`, but it didn't occur to me to sample on the indices. But `indices` should be a `dict` in your solution-- otherwise it'll be ridiculously slow. And then, since I don't care about order, you can just return `[datum for n, datum in enumerate(it) if n in indices]`. – alexis Feb 27 '16 at 13:02
I also considered sorting the indices and simply advancing the iterator to the next selected index (hence avoiding tons of dict lookups), but I doubt it's worth the trouble to code unless the input is truly gigantic. – alexis Feb 27 '16 at 13:06
1

1- Your first code example uses `O(n*k)` (quadratic) algorithm unnecessarily (`index in indices` and `indices.index(index)` are `O(k)` operations). 2- you could use `randrange(i)` instead of `randint(0, i-1)` (same result) 3- if the input has less than `k` items then the order produced by your code is not random (though, perhaps it doesn't matter in this case). You might want to [shuffle the input in this case](http://stackoverflow.com/a/35671225/4279) – jfs Feb 27 '16 at 15:02
@J.F.Sebastian (and Rob), the `O(n*k)` problem is why I said `indices` should be a `dict`, not a list. That would take care of it. – alexis Feb 29 '16 at 11:29

score 4 · Answer 2 · edited May 23 '17 at 12:10

Use O(n) Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling, to select k random elements from iterable:

import itertools
import random

def reservoir_sample(iterable, k):
    it = iter(iterable)
    if not (k > 0):
        raise ValueError("sample size must be positive")

    sample = list(itertools.islice(it, k)) # fill the reservoir
    random.shuffle(sample) # if number of items less then *k* then
                           #   return all items in random order.
    for i, item in enumerate(it, start=k+1):
        j = random.randrange(i) # random [0..i)
        if j < k:
            sample[j] = item # replace item with gradually decreasing probability
    return sample

Example:

>>> reservoir_sample(iter('abcdefghijklmnopqrstuvwxyz'), 5)
['w', 'i', 't', 'b', 'e']

reservoir_sample() code is from this answer.

score 1 · Answer 3 · answered Feb 26 '16 at 18:18

1

If you needed a subset of the original iterator with fixed frequency (i.e., if the generator generates 10000 numbers, you want "statistically" 100 of them, and if it generates 1000000 numbers, you want 10000 of them - always 1%), you would have wrapped the iterator in a construct yielding the inner loop's results with probability of 1%.

So I guess you want instead a fixed number of samples from a source of unknown cardinality, as in the Perl algorithm you mention.

You can wrap the iterator in a construct holding a small memory of its own for the purpose of keeping track of the reservoir, and cycling it with decreasing probability.

import random

def reservoir(iterator, size):
    n = size
    R = iterator[0:n]
    for e in iterator:
        j = random.randint(0, n-1)
        n = n + 1
        if (j < size):
                R[j] = e
    return R

So

print reservoir(range(1, 1000), 3)

might print out

[656, 774, 828]

I have tried generating one million rounds as above, and comparing the distributions of the three columns with this filter (I expected a Gaussian distribution).

#                get first column and clean it
python file.py | cut -f 1 -d " " | tr -cd "0-9\n" \
    | sort | uniq -c | cut -b1-8 | tr -cd "0-9\n" | sort | uniq -c

and while not (yet) truly Gaussian, it looks good enough to me.

answered Feb 26 '16 at 18:18

LSerni

55,617
10
65
107

Good point: A frequency-based sample is sufficient in many cases. I could even oversample by frequency, then use `random.sample()` to trim it down to the desired size. – alexis Feb 27 '16 at 13:11
Can you please clarify why a normal distribution is expected? I can see you're counting the frequencies, but stats is not my strong suite... Also, if you're expecting a Gaussian, why aren't you getting one (or why "not yet")? – alexis Feb 27 '16 at 13:17
1

The gaussian is the limit of a "common" random distribution "aiming" at some point. I'm expecting all numbers to be present with the same frequency on average, but this will not be achieved except by, well, random chance. What will happen is that the values will group around the expected result; if I had an infinite number of samples, the distribution of this 'un-expected-ness' would describe a Gaussian. I'm sorry if I can't explain this better. – LSerni Feb 27 '16 at 15:12
1

Your explanation is pretty clear, though you haven't said how you explain the deviation you see. But anyway I'm not sure your assmption is correct: Frequencies are discrete, not continuous, and cannot go below zero, so no true gaussian is possible. For any particular value drawn, its expected frequency has a Poisson distribution, and since the sum of Poissons is also a Poisson, I *think* that's what you're going to get when you sum over all input values. But statistics is not my strong suit so who knows... – alexis Feb 29 '16 at 11:03

Ethan Furman · Answer 4 · 2016-02-26T18:02:48.867

0

One possible method is to build a generator around the iterator to select random elements:

def random_wrap(iterator, threshold):
    for item in iterator:
        if random.random() < threshold:
            yield item

This method would be useful when you don't know the length and the possible size of the iterator would be prohibitive. Note that guaranteeing the size of the final list is problematic.

Some sample runs:

>>> list(random_wrap(iter('abcdefghijklmnopqrstuvwxyz'), 0.25))
['f', 'h', 'i', 'r', 'w', 'x']

>>> list(random_wrap(iter('abcdefghijklmnopqrstuvwxyz'), 0.25))
['j', 'r', 's', 'u', 'x']

>>> list(random_wrap(iter('abcdefghijklmnopqrstuvwxyz'), 0.25))
['c', 'e', 'h', 'n', 'o', 'r', 'z']

>>> list(random_wrap(iter('abcdefghijklmnopqrstuvwxyz'), 0.25))
['b', 'c', 'e', 'h', 'j', 'p', 'r', 's', 'u', 'v', 'x']

edited Feb 26 '16 at 18:02

answered Feb 26 '16 at 17:51

Ethan Furman

63,992
20
159
237

1

Note that this always returns the results in sorted order. it will never return, for example, `['b', 'a']`. – Robᵩ Feb 26 '16 at 17:52
1

This might be usable for some applications, but it doesn't solve the problem of providing a uniform, *fixed-size* sample. – user2357112 Feb 26 '16 at 17:54
I actually like order-preserving sampling, but indeed there's no way to guarantee the size of the result. I could oversample by frequency and use random.sample() to trim that down to the desired size, but there's always a non-zero chance the original sample will be smaller than the desired size. – alexis Feb 27 '16 at 13:52

score 0 · Answer 5 · answered Feb 27 '16 at 02:35

0

Use the itertools.compress() function, with a random selector function:

itertools.compress(long_sequence, (random.randint(0, 100) < 10 for x in itertools.repeat(1)))

answered Feb 27 '16 at 02:35

aghast

14,785
3
24
56

Thanks for the suggestion, but this does not generate a specific size sample. It would be an ok alternative in many cases, but it's not what the question is about. – alexis Feb 27 '16 at 13:48

Random sample from a very long iterable, in python

5 Answers5

Linked