Python random sample generator (comfortable with huge population sizes)

Question

As you might know random.sample(population,sample_size) quickly returns a random sample, but what if you don't know in advance the size of the sample? You end up in sampling the entire population, or shuffling it, which is the same. But this can be wasteful (if the majority of sample sizes come up to be small compared to population size) or even unfeasible (if population size is huge, running out of memory). Also, what if your code needs to jump from here to there before picking the next element of the sample?

P.S. I bumped into the need of optimizing random sample while working on simulated annealing for TSP. In my code sampling is restarted hundreds of thousands of times, and each time I don't know if I will need to pick 1 element or the 100% of the elements of population.

I don't understand your "question" as written. What's the difference between what you're proposing and popping a `random.choice` (or an `random.randint` index) in a loop? — Two-Bit Alchemist, May 05 '15 at 21:40
If you'd like a review of your code, and it works, see http://codereview.stackexchange.com. If you just want to share what you've written, open an account on GitHub or similar. — jonrsharpe, May 05 '15 at 21:43
edited to make it more appropriate for stack overflow ... it is fine to answer your own questions and even post a question just to share your solution .... fair warning people are usually extra critical of answers that do this — Joran Beasley, May 05 '15 at 21:44
@Two-Bit Alchemist How do you avoid duplicates in the sample with `random.choice`? — mmj, May 05 '15 at 21:54
@jonrsharpe I searched for a solution to my problem on Stack Overflow, and since I haven't found any question about this specific problem, I though that starting a question and posting my solution could have been useful to others like me that search on this site. If it's not appropriate for the site tell me I'll delete the question. — mmj, May 05 '15 at 22:07
StackOverflow isn't the entire universe. If you think a `random.itersample` is useful, the usual thing to do is put it on PyPI and/or the ActiveState recipes, and if you get a lot of traction (or think it's so obviously useful you don't need to wait for that) propose it for inclusion in the stdlib. — abarnert, May 05 '15 at 22:07
@mmj You missed the part where I said _popping_ it in a loop. If you're removing the `random.choice` from your sample each time, there is no repetition (sampling without replacement just like `random.sample` just you're doing the work yourself). — Two-Bit Alchemist, May 06 '15 at 15:47
@Two-Bit Alchemist Yes I missed the `popping`. Such solution is nice and easy, the only drawback is that you have to make a copy of the population, which is time and memory consuming if population is huge. So I would say that such solution is great, but is not *comfortable* with huge population sizes. — mmj, May 06 '15 at 16:03
Possible duplicate of [Python random sample with a generator iterable iterator](http://stackoverflow.com/questions/12581437/python-random-sample-with-a-generator-iterable-iterator) — Matti Lyra, Oct 25 '16 at 09:57

mpolednik · Answer 1 · 2015-05-06T00:22:55.497

At first, I would split the population into blocks. The function to do the block sampling can easily be a generator, being able to process sample of arbitrary size. This also allows you to make the function a generator.

Imagine infinite population, a population block of 512 and sample size of 8. This means you could gather as many samples as you need, and for future reduction again sample the already sampled space (for 1024 blocks this means 8196 samples from which you can sample again).

At the same time, this allows for parallel processing which may be feasible in case of very large samples.

Example considering in-memory population

import random

population = [random.randint(0, 1000) for i in range(0, 150000)]

def sample_block(population, block_size, sample_size):
    block_number = 0
    while 1:
        try:
            yield random.sample(population[block_number * block_size:(block_number + 1) * block_size], sample_size)
            block_number += 1
        except ValueError:
            break

sampler = sample_block(population, 512, 8)
samples = []

try:
    while 1:
        samples.extend(sampler.next())
except StopIteration:
    pass

print random.sample(samples, 200)

If population was external to the script (file, block) the only modification is that you would have to load appropriate chunk to a memory. Proof of concept how sampling of infinite population could look like:

import random
import time

def population():
    while 1:
        yield random.randint(0, 10000)

def reduced_population(samples):
    for sample in samples:
        yield sample

def sample_block(generator, block_size, sample_size):

    block_number = 0
    block = []
    while 1:
        block.append(generator.next())
        if len(block) == block_size:
            s = random.sample(block, sample_size)
            block_number += 1
            block = []
            print 'Sampled block {} with result {}.'.format(block_number, s)
            yield s

samples = []
result = []
reducer = sample_block(population(), 512, 12)

try:
    while 1:
        samples.append(reducer.next())
        if len(samples) == 1000:
            sampler = sample_block(reduced_population(samples), 1000, 15)
            result.append(list(sampler))
            time.sleep(5)
except StopIteration:
    pass

Ideally, you'd also gather the samples and again sample them.

Example idea added, it still uses population in memory but can be extended by always loading chunk of the population. — mpolednik, May 05 '15 at 22:59
I tried your code using `population = list(range(100))` and `sampler = sample_block(population, 10, 4)`, but was not able to get a sample bigger than 40 elements. — mmj, May 05 '15 at 23:43
range(100) gives exactly 100 elements, and using a block size of 10 yields 10 iterations of 4 samples => 40 elements is correct. The formula for number of elements is len(population) / block_size * sample_size. edit: considering that, of course, round numbers and that sample_size <= block_size and that block_size <= len(population). Additional care would need to be taken in order to correctly sample len(remaining_population) where remaining_population < block_size. — mpolednik, May 05 '15 at 23:48
I guessed that, but how can I get a sample of say the 90% of the population, given that when I start picking elements I don't know the final size of the sample? — mmj, May 05 '15 at 23:53
Well this code doesn't care about sample size at all and will go as long as there are data coming. If your population is 10 GiB block storage and you edited the code to read it correctly (that means using an offset for the samples or similar) it would just sample everything after a long while and create a reduced population, that can be further sampled until you get desired final sample size. — mpolednik, May 05 '15 at 23:57
If I understood your code, you must have at least an approximate knowledge of how many elements you will pick, because otherwise you must take care of duplicates, am I right? — mmj, May 06 '15 at 00:09
More than sampling, it is reducing the population. It doesn't handle the duplicates because the data is expected to be sampled again, producing the final sample without the duplicates. — mpolednik, May 06 '15 at 00:12
By "to be sampled again", you mean a brand new sampling? My question is still how can you get a sample of the 90% of the population? — mmj, May 06 '15 at 01:11

Severin Pappadeux · Answer 2 · 2015-05-06T17:28:39.897

1

That's what generators for, I believe. Here is an example of Fisher-Yates-Knuth sampling via generator/yield, you get events one by one and stop when you want to.

Code updated

import random
import numpy
import array

class populationFYK(object):
    """
    Implementation of the Fisher-Yates-Knuth shuffle
    """
    def __init__(self, population):
        self._population = population      # reference to the population
        self._length     = len(population) # lengths of the sequence
        self._index      = len(population)-1 # last unsampled index
        self._popidx     = array.array('i', range(0,self._length))

        # array module vs numpy
        #self._popidx     = numpy.empty(self._length, dtype=numpy.int32)
        #for k in range(0,self._length):
        #    self._popidx[k] = k


    def swap(self, idx_a, idx_b):
        """
        Swap two elements in population
        """
        temp = self._popidx[idx_a]
        self._popidx[idx_a] = self._popidx[idx_b]
        self._popidx[idx_b] = temp

    def sample(self):
        """
        Yield one sampled case from population
        """
        while self._index >= 0:
            idx = random.randint(0, self._index) # index of the sampled event

            if idx != self._index:
                self.swap(idx, self._index)

            sampled = self._population[self._popidx[self._index]] # yielding it

            self._index -= 1 # one less to be sampled

            yield sampled

    def index(self):
        return self._index

    def restart(self):
        self._index = self._length - 1
        for k in range(0,self._length):
            self._popidx[k] = k

if __name__=="__main__":
    population = [1,3,6,8,9,3,2]

    gen = populationFYK(population)

    for k in gen.sample():
        print(k)

edited May 06 '15 at 17:28

answered May 06 '15 at 00:05

Severin Pappadeux

18,636
3
38
64

Nice method, but if initial copy of population can't be avoided, it would be time and memory consuming for huge populations; not a decisive improvement from just yielding a shuffled copy of the entire population (it just avoids the shuffling). – mmj May 06 '15 at 08:29
@mmj that's correct. If population collection could be destroyed, that's one option where we could avoid making the copy. If not, and population record is large, the best way to use sampling over compact array of indices. So sampling done over indices, and population left intact. See updated code – Severin Pappadeux May 06 '15 at 14:52
@mmj and if you don't like numpy, there is built-in array module, see another code update – Severin Pappadeux May 06 '15 at 15:03
An array of indexes is better, but in case of huge populations, or if you have to restart often the sampling and if most of sample sizes are small compared to population size, this solution, although nice, is not optimal. – mmj May 06 '15 at 16:22
@mmj Restart is quite trivial, added the code. You don't have to pay for new allocation. – Severin Pappadeux May 06 '15 at 17:30
I compared our codes, yours is faster when sample size end up to be bigger than about 1% of population size, otherwise mine is faster. Anyway, when it comes to huge population sizes (about 100 millions for my system), your code runs out of memory. – mmj May 07 '15 at 10:18
@mmj This is VERY strange. FYK shuffle is known to be linear (and my implementation is), and yours is something like O(n ln(n)), so I expected it to be other way around. Running out of memory - well, that's what might be expected, though indices are compact, adding 400Mb for 100M records. Running 32bit python, I would venture? – Severin Pappadeux May 08 '15 at 00:09
I guess FYK is linear *after* the copy of population. Mine solution, on the contrary, does not need a copy of the population at all, and that's why it can also deal with population of any size. – mmj May 08 '15 at 01:26
@mmj There is no population copy in the my current code, just the (const) reference. Your solution is more memory efficient, true, because you keep selected indeces int[`ns`], and I keep all indeces int[`N`] but I find it hard to believe yours is faster as selection % is going higher. You have to do binary search to check if you have collided sample, and I don't. – Severin Pappadeux May 08 '15 at 15:15
By "copy of population" I meant building the indexes. Even if indexes are lighter then actual population records, when it comes to size of millions, building the indexes has its cost. I think you misunderstood my previous comment; mine code is faster when sample size is going lower (not higher) compared to population size. – mmj May 08 '15 at 15:30
@mmj Yes, indeces does have their cost, no doubts. `mine code is faster when sample size is going lower (not higher) compared to population size` that does make sense. BTW, I have another idea, let me write it in another answer – Severin Pappadeux May 08 '15 at 16:08
@mmj Done wrt Bloom filter – Severin Pappadeux May 08 '15 at 16:27

mmj · Answer 3 · 2015-05-07T09:25:33.230

I wrote (in Python 2.7.9) a random sampler generator (of indexes) whose speed depends only on sample size (it should be O(ns log(ns)) where ns is sample size). So it is fast when sample size is small compared to population size, because it does NOT depend at all on population size. It doesn't build any population collection, it just picks random indexes and uses a kind of bisect method on sampled indexes to avoid duplicates and keep then sorted. Given an iterable population, here's how to use itersample generator:

import random
sampler=itersample(len(population))
next_pick=sampler.next() # pick the next random (index of) element

or

import random
sampler=itersample(len(population))
sample=[]
for index in sampler:
    # do something with (index of) picked element
    sample.append(index) # build a sample
    if some_condition: # stop sampling when needed
        break

If you need the actual elements and not just the indexes, just apply population iterable to the index when needed (population[sampler.next()] and population[index] respectively for first and second example).

The results of some tests show that speed does NOT depend on population size, so if you need to randomly pick only 10 elements from a population of 100 billions, you pay only for 10 (remember, we don't know in advance how many elements we'll pick, otherwise you'd better use random.sample).

Sampling 1000 from 1000000
Using itersample 0.0324 s

Sampling 1000 from 10000000
Using itersample 0.0304 s

Sampling 1000 from 100000000
Using itersample 0.0311 s

Sampling 1000 from 1000000000
Using itersample 0.0329 s

Other tests confirm that running time is slightly more than linear with sample size:

Sampling 100 from 1000000000
Using itersample 0.0018 s

Sampling 1000 from 1000000000
Using itersample 0.0294 s

Sampling 10000 from 1000000000
Using itersample 0.4438 s

Sampling 100000 from 1000000000
Using itersample 8.8739 s

Finally, here is the generator function itersample:

import random
def itersample(c): # c: population size
    sampled=[]
    def fsb(a,b): # free spaces before middle of interval a,b
        fsb.idx=a+(b+1-a)/2
        fsb.last=sampled[fsb.idx]-fsb.idx if len(sampled)>0 else 0
        return fsb.last
    while len(sampled)<c:
        sample_index=random.randrange(c-len(sampled))
        a,b=0,len(sampled)-1
        if fsb(a,a)>sample_index:
            yielding=sample_index
            sampled.insert(0,yielding)
            yield yielding
        elif fsb(b,b)<sample_index+1:
            yielding=len(sampled)+sample_index
            sampled.insert(len(sampled),yielding)
            yield yielding
        else: # sample_index falls inside sampled list
            while a+1<b:
                if fsb(a,b)<sample_index+1:
                    a=fsb.idx
                else:
                    b=fsb.idx
            yielding=a+1+sample_index
            sampled.insert(a+1,yielding)
            yield yielding

This code is really confusing. Why are you using attributes on the functions all over the place? While that can *occasionally* be a good way to save state in a place it can be easily introspected, in this case the `samle_gen.sampled` list is just a glorified global, and the rest makes the code very hard to follow. Also, as far as I can tell, sampling using this generator going to cost `O(ns**2)` time (where `ns` is the number of samples), not `O(ns*log(ns))` like you claim, because each `list.insert` call you make on `sample_gen.sampled` will take `O(ns)` on average. — Blckknght, May 06 '15 at 02:31
Sorry if the code is confusing, but actually the algorithm is not trivial. `sample_gen.sampled` attribute is needed to reset the generator from outside (we don't know in advance the size of the sample, so there must be a way to *manually* reset the generator), if you suggest a better way I'll be glad to implement it. Speed tests confirm my hypothesis that time is slightly more than linear with sample size (see the updated answer). — mmj, May 06 '15 at 08:27
Um, usually a generator doesn't manipulate global state. Its state is its local variables. To restart the generator, you call it again. — Blckknght, May 06 '15 at 08:44
In the original code that I wrote this generator for, a global state variable is necessary, cause when I enter the function that calls the sampling, I may need to continue the previous sampling. That's why I originally included the global state in the generator, but I guess you are right, it's better to separate the global state from the generator, I'll update the answer. My only doubt is; if I have to start a brand new sampling millions of times, and most of the times generators don't reach their end, would all such *suspended* generators consume memory? I hope GC will take care of them. — mmj, May 06 '15 at 09:54
Generator name was changed from `sample_gen` to `itersample`. — mmj, May 06 '15 at 10:49
`random.sample` has a special implementation for exactly this case using xrange. You can replace your entire code simply by `random.sample(xrange(population_size), sample_size)` which [is especially fast and space efficient for sampling from a large population](https://docs.python.org/2/library/random.html) — Thomas B., May 06 '15 at 11:41
@Thomas B. The question addresses the case when you don't know sample size in advance; `sample_size` was used in the examples only for example purposes, but I admit that it was really misleading, so I rewrote the examples from stratch, thanks for pointing it out. — mmj, May 07 '15 at 09:30

score 0 · Answer 4 · edited May 23 '17 at 10:34

You can get a sample of size K out of a population of size N by picking K non-repeating random-numbers in the range [0...N[ and treat them as indexes.

Option a)

You could generate such a index-sample using the well-known sample method.

random.sample(xrange(N), K)

From the Python docs about random.sample:

To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population

Option b)

If you don't like the fact that random.sample already returns a list instead of a lazy generator of non-repeating random numbers, you can go fancy with Format-Preserving Encryption to encrypt a counter.

This way you get a real generator of random indexes, and you can pick as many as you want and stop at any time, without getting any duplicates, which gives you dynamically sized sample sets.

The idea is to construct an encryption scheme to encrypt the numbers from 0 to N. Now, for each time you want to get a sample from your population, you pick a random key for your encryption and start to encrypt the numbers from 0, 1, 2, ... onwards (this is the counter). Since every good encryption creates a random-looking 1:1 mapping you end up with non-repeating random integers you can use as indexes. The storage requirements during this lazy generation is just the initial key plus the current value of the counter.

The idea was already discussed in Generating non-repeating random numbers in Python. There even is a python snippet linked: formatpreservingencryption.py

A sample code using this snippet could be implemented like this:

def itersample(population):
    # Get the size of the population
    N = len(population)
    # Get the number of bits needed to represent this number
    bits = (N-1).bit_length()
    # Generate some random key
    key = ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(32))
    # Create a new crypto instance that encrypts binary blocks of width <bits>
    # Thus, being able to encrypt all numbers up to the nearest power of two
    crypter = FPEInteger(key=key, radix=2, width=bits)

    # Count up 
    for i in xrange(1<<bits):
        # Encrypt the current counter value
        x = crypter.encrypt(i)
        # If it is bigger than our population size, just skip it
        # Since we generate numbers up to the nearest power of 2, 
        # we have to skip up to half of them, and on average up to one at a time
        if x < N:
            # Return the randomly chosen element
            yield population[x]

For *option a)* you need to know sample size in advance, which is out of the scope of this question. *Option b)* seems to behave similarly to the code in my answer (except for the method it uses to obtain the result), and I'd like to compare their performance, but I'm not sure how to use the code you linked; can you make an example of how to pick the next random number in the interval (0,N)? — mmj, May 06 '15 at 13:35
I've added some sample code for option b). It is most probably slightly slower because of the usage of AES. But you could use some faster cryptographic scheme. Maybe DES. If you use the random module, you probably don't care about cryptographic safe randomness. Nice part is, that it only uses O(1) additional storage, entirely independent of the size of your sample set. — Thomas B., May 06 '15 at 14:47
@ThomasB. Problem is, statement like "a seemingly random value" has to be VERIFIED. Behind Python or C++ RNG like Mersenne twister there are efforts to build and RNG test suites, check multi-D distribution etc. Encription mapping from index to some other index is 1:1, that's true. This mapping is reversible. But is it random? I'm yet to get clear idea if those sequences passed any serious RNG test suite (DieHard and alike) — Severin Pappadeux, May 06 '15 at 15:34
I was not able to try your code cause I couldn't install `pycrypto`, due to obscure installation errors. Is there any chance that you can eliminate dependencies from your code? — mmj, May 06 '15 at 17:05
Any statistically significant derivation from "randomness" would represent a weakness in the underlying cipher. The stronger the cipher -> the better the randomness. — Thomas B., May 06 '15 at 17:06
The Usage of Block Ciphers in Counter Mode (CTR) to generate cryptographically secure pseudo-random numbers is a well known strategy. Section 10.2.1. of [NIST SP 800-90A](http://csrc.nist.gov/publications/nistpubs/800-90A/SP800-90A.pdf) gives additional tips. The german Wikipedia additionally states that according to [this paper](http://dl.acm.org/citation.cfm?doid=1268776.1268777) Generators based on well-known ciphers as AES pass all common statistic tests. — Thomas B., May 06 '15 at 17:06
Another [paper](http://random.mat.sbg.ac.at/publics/ftp/pub/publications/peter/aes_sub.ps) states that "AES performed very well [...] also a very interesting nonlinear random number generator for stochastic simulation. We were not able to find any statistical weaknesses" The pLab [suggests](http://random.mat.sbg.ac.at/software/): "simply use AES in counter mode" — Thomas B., May 06 '15 at 17:24

score 0 · Answer 5 · answered May 08 '15 at 16:27

Here is another idea. So for huge population we would like to keep some info about selected records. In your case you keep one integer index per selected record - 32bit or 64bbit integer, plus some code to do reasonable search wrt selected/not selected. In case of large number of selected records this record keeping might be prohibitive. What I would propose is to use Bloom filter for selected indeces set. False positive matches are possible, but false negatives are not, thus no risk to get duplicated records. It does introduce slight bias - false positives records will be excluded from sampling. But memory efficiency is good, fewer than 10 bits per element are required for a 1% false positive probability. So if you select 5% of the population and have 1% false positive, you missed 0.0005 of your population, depending on requirements might be ok. If you want lower false positive, use more bits. But memory efficiency would be a lot better, though I expect there is more code to execute per record sample.

Sorry, no code

Interesting idea, if someone wished to develop some code it I'd be glad to test it. — mmj, May 11 '15 at 09:43

Python random sample generator (comfortable with huge population sizes)

5 Answers5

Linked