55

Do you know if there is a way to get python's random.sample to work with a generator object. I am trying to get a random sample from a very large text corpus. The problem is that random.sample() raises the following error.

TypeError: object of type 'generator' has no len()

I was thinking that maybe there is some way of doing this with something from itertools but couldn't find anything with a bit of searching.

A somewhat made up example:

import random
def list_item(ls):
    for item in ls:
        yield item

random.sample( list_item(range(100)), 20 )


UPDATE


As per MartinPieters's request I did some timing of the currently proposed three methods. The results are as follows.

Sampling 1000 from 10000
Using iterSample 0.0163 s
Using sample_from_iterable 0.0098 s
Using iter_sample_fast 0.0148 s

Sampling 10000 from 100000
Using iterSample 0.1786 s
Using sample_from_iterable 0.1320 s
Using iter_sample_fast 0.1576 s

Sampling 100000 from 1000000
Using iterSample 3.2740 s
Using sample_from_iterable 1.9860 s
Using iter_sample_fast 1.4586 s

Sampling 200000 from 1000000
Using iterSample 7.6115 s
Using sample_from_iterable 3.0663 s
Using iter_sample_fast 1.4101 s

Sampling 500000 from 1000000
Using iterSample 39.2595 s
Using sample_from_iterable 4.9994 s
Using iter_sample_fast 1.2178 s

Sampling 2000000 from 5000000
Using iterSample 798.8016 s
Using sample_from_iterable 28.6618 s
Using iter_sample_fast 6.6482 s

So it turns out that the array.insert has a serious drawback when it comes to large sample sizes. The code I used to time the methods

from heapq import nlargest
import random
import timeit


def iterSample(iterable, samplesize):
    results = []
    for i, v in enumerate(iterable):
        r = random.randint(0, i)
        if r < samplesize:
            if i < samplesize:
                results.insert(r, v) # add first samplesize items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < samplesize:
        raise ValueError("Sample larger than population.")

    return results

def sample_from_iterable(iterable, samplesize):
    return (x for _, x in nlargest(samplesize, ((random.random(), x) for x in iterable)))

def iter_sample_fast(iterable, samplesize):
    results = []
    iterator = iter(iterable)
    # Fill in the first samplesize elements:
    for _ in xrange(samplesize):
        results.append(iterator.next())
    random.shuffle(results)  # Randomize their positions
    for i, v in enumerate(iterator, samplesize):
        r = random.randint(0, i)
        if r < samplesize:
            results[r] = v  # at a decreasing rate, replace random items

    if len(results) < samplesize:
        raise ValueError("Sample larger than population.")
    return results

if __name__ == '__main__':
    pop_sizes = [int(10e+3),int(10e+4),int(10e+5),int(10e+5),int(10e+5),int(10e+5)*5]
    k_sizes = [int(10e+2),int(10e+3),int(10e+4),int(10e+4)*2,int(10e+4)*5,int(10e+5)*2]

    for pop_size, k_size in zip(pop_sizes, k_sizes):
        pop = xrange(pop_size)
        k = k_size
        t1 = timeit.Timer(stmt='iterSample(pop, %i)'%(k_size), setup='from __main__ import iterSample,pop')
        t2 = timeit.Timer(stmt='sample_from_iterable(pop, %i)'%(k_size), setup='from __main__ import sample_from_iterable,pop')
        t3 = timeit.Timer(stmt='iter_sample_fast(pop, %i)'%(k_size), setup='from __main__ import iter_sample_fast,pop')

        print 'Sampling', k, 'from', pop_size
        print 'Using iterSample', '%1.4f s'%(t1.timeit(number=100) / 100.0)
        print 'Using sample_from_iterable', '%1.4f s'%(t2.timeit(number=100) / 100.0)
        print 'Using iter_sample_fast', '%1.4f s'%(t3.timeit(number=100) / 100.0)
        print ''

I also ran a test to check that all the methods indeed do take an unbiased sample of the generator. So for all methods I sampled 1000 elements from 10000 100000 times and computed the average frequency of occurrence of each item in the population which turns out to be ~.1 as one would expect for all three methods.

Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
  • 6
    Have you tried `random.sample(list(gen), 20)` -- it might not be too slow! – Katriel Sep 25 '12 at 10:54
  • What exactly are you sampling from the corpus? Is there any way to represent it as something else than a generator? – Fred Foo Sep 25 '12 at 10:56
  • @larsmans words and sentences - I am trying to keep the memory consumption down with using the generator object. – Matti Lyra Sep 25 '12 at 10:57

8 Answers8

28

While the answer of Martijn Pieters is correct, it does slow down when samplesize becomes large, because using list.insert in a loop may have quadratic complexity.

Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:

def iter_sample_fast(iterable, samplesize):
    results = []
    iterator = iter(iterable)
    # Fill in the first samplesize elements:
    try:
        for _ in xrange(samplesize):
            results.append(iterator.next())
    except StopIteration:
        raise ValueError("Sample larger than population.")
    random.shuffle(results)  # Randomize their positions
    for i, v in enumerate(iterator, samplesize):
        r = random.randint(0, i)
        if r < samplesize:
            results[r] = v  # at a decreasing rate, replace random items
    return results

The difference slowly starts to show for samplesize values above 10000. Times for calling with (1000000, 100000):

  • iterSample: 5.05s
  • iter_sample_fast: 2.64s
Dzinx
  • 55,586
  • 10
  • 60
  • 78
  • 1
    would using `results = list(itertools.islice(iterator, samplesize))` produce any further improvement? – Martijn Pieters Sep 26 '12 at 09:35
  • @MartijnPieters: I was just thinking about that. It would make the code simpler, but if a `ValueError` is desired, then there still has to be a check. (I'd personally handle that outside of this function if I were writing it for a library.) – Fred Foo Sep 26 '12 at 09:37
  • 1
    @larsmans: Instead of `try:`/`except StopIteration:` it'd be `if len(results) < samplesize:`. If `list(islice())` is faster than repeated `.append()` then that'd be worth it. – Martijn Pieters Sep 26 '12 at 09:48
  • 4
    This is reservoir sampling? Right https://en.wikipedia.org/wiki/Reservoir_sampling – Ale Mar 06 '17 at 20:29
  • Does anyone have a proof that this procedure is unbiased? – static_rtti Jul 31 '17 at 20:00
23

You can't.

You have two options: read the whole generator into a list, then sample from that list, or use a method that reads the generator one by one and picks the sample from that:

import random

def iterSample(iterable, samplesize):
    results = []

    for i, v in enumerate(iterable):
        r = random.randint(0, i)
        if r < samplesize:
            if i < samplesize:
                results.insert(r, v) # add first samplesize items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < samplesize:
        raise ValueError("Sample larger than population.")

    return results

This method adjusts the chance that the next item is part of the sample based on the number of items in the iterable so far. It doesn't need to hold more than samplesize items in memory.

The solution isn't mine; it was provided as part of another answer here on SO.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    I was afraid that might be the case, seems like something that should be in the standard lib though. – Matti Lyra Sep 25 '12 at 10:58
  • @MattiLyra: Feel free to propose it's addition to the stdlib. – Martijn Pieters Sep 25 '12 at 10:59
  • 1
    So just to check that I understand the logic of the code. It is a uniform sample from the entire generator, because the items are replaced in the result set if `samplesize` is reached before the end of the generator, allowing the later items a chance to be selected? – Matti Lyra Sep 25 '12 at 11:08
  • Do you think it would be possible to `append` instead of `insert` items? `insert` takes linear time in the average case. – Fred Foo Sep 25 '12 at 12:25
  • 1
    @larsmans: No! The insertion is instrumental in ensuring that the sample is uniform. – Martijn Pieters Sep 25 '12 at 12:27
  • @MartijnPieters: in that case, my solution would be better asymptotically (though only in *k*, not in *n*). – Fred Foo Sep 25 '12 at 12:29
  • why not initialize the size of `results` to be `samplesize` – Matti Lyra Sep 25 '12 at 12:38
  • @MattiLyra: We'd still need to randomize the results at that time, and pick those `samplesize` items from the iterator, and check if there were at least `samplesize` items. The code wouldn't be any simpler. – Martijn Pieters Sep 25 '12 at 12:40
  • @MartijnPieters no but it would avoid growing the size of the array dynamically, which can be costly if the array is large. – Matti Lyra Sep 25 '12 at 12:42
  • 2
    @MattiLyra: There is no additional cost for adding items to python lists when they are large. See [Python Time Complexity](http://wiki.python.org/moin/TimeComplexity); appending is O(1) constant cost. – Martijn Pieters Sep 25 '12 at 12:44
  • @larsmans: I'd love to see a comparison for different sizes of `k` and `n`. – Martijn Pieters Sep 25 '12 at 12:58
  • @ MartijnPieters: I'm years late with this comment, but — Appending to a list is O(1) _amortized_ constant cost, as shown on [the page you linked](https://wiki.python.org/moin/TimeComplexity). @MattiLyra is correct that reallocating a big array *does* take linear time, even if Python does guarantee that reallocation happens only (exponentially) infrequently. So a solution that reserved all `samplesize` elements up front would be `O(samplesize)` _faster_ than a solution that performed all `lg(samplesize)` reallocations. – Quuxplusone Dec 27 '18 at 15:26
8

Just for the heck of it, here's a one-liner that samples k elements without replacement from the n items generated in O(n lg k) time:

from heapq import nlargest

def sample_from_iterable(it, k):
    return (x for _, x in nlargest(k, ((random.random(), x) for x in it)))
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • so you give a random key to each element in `it` when you pass it to to the heap? – Matti Lyra Sep 25 '12 at 12:33
  • @MattiLyra: yep. It would be even easier to pass `key=random.random()` to `nlargest`, but I'm afraid that would break the heap invariants. This does suppose that your values are comparable in the case of ties between the random keys. – Fred Foo Sep 25 '12 at 12:41
  • @MartijnPieters: it does since 2.6. If you were looking at the `heapq.py` source code, then scroll down, as `nlargest` is redefined at the end of the file. – Fred Foo Sep 25 '12 at 13:02
  • If you were to use `key` the distribution would not be properly random. For any value in the iterable where `random.random()` produced the exact same float, the *first* of the two values of the iterable would always be chosen (because `nlargest(.., key)` uses `(key(value), [decreasing counter starting at 0], value)` tuples). In your method the *larger* of the two values would be preferred in that case. So in both methods there is an (ever so) slight bias. – Martijn Pieters Sep 26 '12 at 08:46
  • 1
    @MartijnPieters: hmm, I guess you're right. However, the bias can be made arbitrarily small by letting `random.random` sample from a larger range, so I think the distribution is asymptotically uniform :) – Fred Foo Sep 26 '12 at 09:18
4

I am trying to get a random sample from a very large text corpus.

Your excellent synthesis answer currently shows victory for iter_sample_fast(gen, pop). However, I tried Katriel's recommendation of random.sample(list(gen), pop) — and it's blazingly fast by comparison!

def iter_sample_easy(iterable, samplesize):
    return random.sample(list(iterable), samplesize)

Sampling 1000 from 10000
Using iter_sample_fast 0.0192 s
Using iter_sample_easy 0.0009 s

Sampling 10000 from 100000
Using iter_sample_fast 0.1807 s
Using iter_sample_easy 0.0103 s

Sampling 100000 from 1000000
Using iter_sample_fast 1.8192 s
Using iter_sample_easy 0.2268 s

Sampling 200000 from 1000000
Using iter_sample_fast 1.7467 s
Using iter_sample_easy 0.3297 s

Sampling 500000 from 1000000
Using iter_sample_easy 0.5628 s

Sampling 2000000 from 5000000
Using iter_sample_easy 2.7147 s

Now, as your corpus gets very large, materializing the whole iterable into a list will use prohibitively large amounts of memory. But we can still exploit Python's blazing-fast-ness if we can chunk up the problem: basically, we pick a CHUNKSIZE that is "reasonably small," do random.sample on chunks of that size, and then use random.sample again to merge them together. We just have to get the boundary conditions right.

I see how to do it if the length of list(iterable) is an exact multiple of CHUNKSIZE and not bigger than samplesize*CHUNKSIZE:

def iter_sample_dist_naive(iterable, samplesize):
    CHUNKSIZE = 10000
    samples = []
    it = iter(iterable)
    try:
        while True:
            first = next(it)
            chunk = itertools.chain([first], itertools.islice(it, CHUNKSIZE-1))
            samples += iter_sample_easy(chunk, samplesize)
    except StopIteration:
        return random.sample(samples, samplesize)

However, the code above produces a non-uniform sampling when len(list(iterable)) % CHUNKSIZE != 0, and it runs out of memory as len(list(iterable)) * samplesize / CHUNKSIZE becomes "very large." Fixing these bugs is above my pay grade, I'm afraid, but a solution is described in this blog post and sounds quite reasonable to me. (Search terms: "distributed random sampling," "distributed reservoir sampling.")

Sampling 1000 from 10000
Using iter_sample_fast 0.0182 s
Using iter_sample_dist_naive 0.0017 s
Using iter_sample_easy 0.0009 s

Sampling 10000 from 100000
Using iter_sample_fast 0.1830 s
Using iter_sample_dist_naive 0.0402 s
Using iter_sample_easy 0.0103 s

Sampling 100000 from 1000000
Using iter_sample_fast 1.7965 s
Using iter_sample_dist_naive 0.6726 s
Using iter_sample_easy 0.2268 s

Sampling 200000 from 1000000
Using iter_sample_fast 1.7467 s
Using iter_sample_dist_naive 0.8209 s
Using iter_sample_easy 0.3297 s

Where we really win is when samplesize is very small relative to len(list(iterable)).

Sampling 20 from 10000
Using iterSample 0.0202 s
Using sample_from_iterable 0.0047 s
Using iter_sample_fast 0.0196 s
Using iter_sample_easy 0.0001 s
Using iter_sample_dist_naive 0.0004 s

Sampling 20 from 100000
Using iterSample 0.2004 s
Using sample_from_iterable 0.0522 s
Using iter_sample_fast 0.1903 s
Using iter_sample_easy 0.0016 s
Using iter_sample_dist_naive 0.0029 s

Sampling 20 from 1000000
Using iterSample 1.9343 s
Using sample_from_iterable 0.4907 s
Using iter_sample_fast 1.9533 s
Using iter_sample_easy 0.0211 s
Using iter_sample_dist_naive 0.0319 s

Sampling 20 from 10000000
Using iterSample 18.6686 s
Using sample_from_iterable 4.8120 s
Using iter_sample_fast 19.3525 s
Using iter_sample_easy 0.3162 s
Using iter_sample_dist_naive 0.3210 s

Sampling 20 from 100000000
Using iter_sample_easy 2.8248 s
Using iter_sample_dist_naive 3.3817 s
Quuxplusone
  • 23,928
  • 8
  • 94
  • 159
1

If the population size n is known, here is some memory efficient code that loops over a generator, extracting only the target samples:

from random import sample
from itertools import count, compress

targets = set(sample(range(n), k=10))
for selection in compress(pop, map(targets.__contains__, count())):
    print(selection)

This outputs the selections in the order they are produced by the population generator.

The technique is to use the standard library random.sample() to randomly select the target indices for the selections. The second like determines whether a given index is among the targets and if so gives the corresponding value from the generator.

For example, given targets of {6, 2, 4}:

0  1  2  3  4  5  6  7  8  9  10   ...  output of count()
F  F  T  F  T  F  T  F  F  F  F    ...  is the count in targets?
A  B  C  D  E  F  G  H  I  J  K    ...  output of the population generator
-  -  C  -  E  -  G  -  -  -  -    ...  selections emitted by compress

This technique is suitable for looping over a corpus too large to fit in memory (otherwise, you could just use sample() directly on the population).

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
0

If the number of items in the iterator is known (by elsewhere counting the items), another approach is:

def iter_sample(iterable, iterlen, samplesize):
    if iterlen < samplesize:
        raise ValueError("Sample larger than population.")
    indexes = set()
    while len(indexes) < samplesize:
        indexes.add(random.randint(0,iterlen))
    indexesiter = iter(sorted(indexes))
    current = indexesiter.next()
    ret = []
    for i, item in enumerate(iterable):
        if i == current:
            ret.append(item)
            try:
                current = indexesiter.next()
            except StopIteration:
                break
    random.shuffle(ret)
    return ret

I find this quicker, especially when sampsize is small in relation to iterlen. When the whole, or near to the whole, sample is asked for however, there are issues.

iter_sample (iterlen=10000, samplesize=100) time: (1, 'ms') iter_sample_fast (iterlen=10000, samplesize=100) time: (15, 'ms')

iter_sample (iterlen=1000000, samplesize=100) time: (65, 'ms') iter_sample_fast (iterlen=1000000, samplesize=100) time: (1477, 'ms')

iter_sample (iterlen=1000000, samplesize=1000) time: (64, 'ms') iter_sample_fast (iterlen=1000000, samplesize=1000) time: (1459, 'ms')

iter_sample (iterlen=1000000, samplesize=10000) time: (86, 'ms') iter_sample_fast (iterlen=1000000, samplesize=10000) time: (1480, 'ms')

iter_sample (iterlen=1000000, samplesize=100000) time: (388, 'ms') iter_sample_fast (iterlen=1000000, samplesize=100000) time: (1521, 'ms')

iter_sample (iterlen=1000000, samplesize=1000000) time: (25359, 'ms') iter_sample_fast (iterlen=1000000, samplesize=1000000) time: (2178, 'ms')

0

Fastest method until proven otherwise when you have an idea about how long the generator is (and will be asymptotically uniformly distributed):

def gen_sample(generator_list, sample_size, iterlen):
    num = 0
    inds = numpy.random.random(iterlen) <= (sample_size * 1.0 / iterlen)
    results = []
    iterator = iter(generator_list)
    gotten = 0
    while gotten < sample_size: 
        try:
            b = iterator.next()
            if inds[num]: 
                results.append(b)
                gotten += 1
            num += 1    
        except: 
            num = 0
            iterator = iter(generator_list)
            inds = numpy.random.random(iterlen) <= ((sample_size - gotten) * 1.0 / iterlen)
    return results

It is both the fastest on the small iterable as well as the huge iterable (and probably all in between then)

# Huge
res = gen_sample(xrange(5000000), 200000, 5000000)
timing: 1.22s

# Small
z = gen_sample(xrange(10000), 1000, 10000) 
timing: 0.000441    
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
0

Here's a radically different variation that uses a set as a bucket of items. It starts by priming the bucket with pool items, and then yield samples from the bucket, replacing them from the iterator, finally it drains what is left of the bucket.

HashWrapper serves to hide unhashable types from set.

class HashWrapper(tuple):
    """Wrap unhashable type."""
    def __hash__(self):
        return id(self)


def randomize_iterator(data: Iterator, pool=100) -> Iterator:
    """
    Randomize an iterator.
    """

    bucket = set()
    iterator = iter(data)

    # Prime the bucket
    for _ in range(pool):
        try:
            bucket.add(HashWrapper(next(iterator)))
        except StopIteration:
            # We've drained the iterator
            break

    # Start picking from the bucket and replacing new items from the iterator
    for item in iterator:
        sample, = random.sample(bucket, 1)
        yield sample
        bucket.remove(sample)
        bucket.add(HashWrapper(item))

    # Drain the bucket
    yield from random.sample(bucket, len(bucket))
Danielle Madeley
  • 2,616
  • 1
  • 19
  • 26