Given iterable, select each element independently with probability p

Question

I need a function to do the following in Python.

I need to iterate through an input list and choose each element with probability p (which is small).

Here is a naive implementation to make clear what I'm doing.

def f(inputList,p):
    for element in inputList:
        if random.random()<p:
            yield element

Each random number generation is expensive. We can do fewer random number generations by calculating initially how long it will take before the first success and then jumping to that entry. I have a method, but wonder if there is something already in existence or a better way to code this. In principle I just need it for lists, but would like something that works on a general iterable.

def jump_calculation(p):
    if p == 1:
        return 0
    r = random.random()
    k = int(scipy.log(1-r)/scipy.log(1-p))
    return k
def binomial_choice(L,p):
    jump = jump_calculation(p)
    for element in L:
        jump -= 1
        if jump<0:
            yield element
            jump = jump_calculation(p)

Am I reinventing the wheel? If not, are the obvious improvements to make the code easier to read?

Why is the latter "better"? It doesn't look like it's doing less work, and it's certainly not easier to understand at least not for me. — unwind, Jul 02 '15 at 10:06
@unwind: The "naive" version generates a random number for every single element in the list and then does a comparison. The improved version just subtracts 1 and then does a comparison - it only generates 1 random number per returned value. I'm trying to speed up code that spends most of its time generating random numbers for every element in a list. — Joel, Jul 02 '15 at 10:20
@Joel Ah, I see, that makes sense of course. I failed to consider the generation of a random number as an expensive operation. Thanks. — unwind, Jul 02 '15 at 10:21
@unwind I've edited the question to add this. If I were sure that it were a list, I could actually jump immediately to the correct element of a list, so the run time would go from linear in number of elements in input to linear in number of returned elements. — Joel, Jul 02 '15 at 10:23
How about first generating a random number of elements from B(n, p), then passing it to `random.sample`? — bereal, Jul 02 '15 at 10:34
@bereal I'd like it to work for any iterable. So I may not know what n is a priori. I'll edit the title to make this clearer. — Joel, Jul 02 '15 at 10:35
Well, as you are using scipy anyway, you could just use `numpy.random.geometric(p)` instead. — Kolmar, Jul 02 '15 at 11:06
If you want to handle generic iterables, why not use `next` to get each item, and just handle `StopIteration` when/if the iterable runs out? — jonrsharpe, Jul 02 '15 at 11:08

Kolmar · Accepted Answer · 2015-07-02T12:17:36.437

The number of Bernoulli trials until the first success is represented by geometric distribution. So you can use it to generate the number of items to skip with numpy.random.geometric:

import itertools
import numpy

def binomial_choice(L, p):
    iterator = iter(L)
    while True:
        to_skip = numpy.random.geometric(p) - 1
        yield next(itertools.islice(iterator, to_skip, None))

This works for any iterators, and you don't have to know the length in advance.

But for Python 3.5+ you'd have to use a more complex version because of PEP 479:

def binomial_choice(L, p):
    iterator = iter(L)
    try:
        while True:
            to_skip = numpy.random.geometric(p) - 1
            yield next(itertools.islice(iterator, to_skip, None))
    except StopIteration:
        return

Usage examples:

In [1]: list(binomial_choice(range(100), 0.05))
Out[1]: [9, 15, 31, 53, 92, 93]

In [2]: list(binomial_choice(range(5), 1))
Out[2]: [0, 1, 2, 3, 4]

The distribution seems to be pretty correct:

In [5]: sum(len(list(binomial_choice(range(100), 0.05))) for i in range(100000)) / 100000
Out[5]: 4.99883

It is also faster than two of your approaches:

In [14]: timeit list(binomial_choice_geometric(range(1000), 0.01))
10000 loops, best of 3: 24.4 µs per loop

In [11]: timeit list(binomial_choice_geometric_3_5(range(1000), 0.01))
10000 loops, best of 3: 42.7 µs per loop

In [12]: timeit list(binomial_choice_jump_calculation(range(1000), 0.01))
1000 loops, best of 3: 596 µs per loop

In [13]: timeit list(binomial_choice_foreach_random(range(1000), 0.01))
1000 loops, best of 3: 203 µs per loop

It actually runs on the scale of the random.sample approach from the other answer (modified with the suggestion from the comment to use numpy.random.binomial to get the correct distribution), but doesn't require to have a list, and to know the len of the argument in advance:

In [19]: timeit list(binomial_choice_random_sample(range(1000), 0.01))
10000 loops, best of 3: 19.8 µs per loop

oopcode · Answer 2 · 2015-07-02T10:57:06.153

1

Not sure if this is what you want, but.. You could use random.sample() to take a random sample from a list. It has an argument which specifies the sample size, and you can compute this size based on the list's length. I mean, if the probability is small, then the sample size is small.

from random import sample
a = range(100)
probability = 0.5
max_sz = int(len(a) * probability)
sz = randint(0, max_sz)
print sample(a, sz)
# [34, 81, 58, 52, 9, 86, 57, 29, 3, 99]

P.S. Oh, I just noticed that the idea was already introduced in comments, and that you want to be able to work with unknown iterable size. Sorry. Still, I'll leave it here.

edited Jul 02 '15 at 10:57

answered Jul 02 '15 at 10:49

oopcode

1,912
16
26

The distribution is not quite correct, what you call `max_sz` is actually a mean (see [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution)). – bereal Jul 02 '15 at 11:31
1

tl;dr: if you want to make it correct, use `sz = numpy.random.binomial(len(a), probability)` – bereal Jul 02 '15 at 11:36

Given iterable, select each element independently with probability p

2 Answers2