High performance weighted random choice for python 2?

Question

I have the following python method, which selects a weighted random element from the sequence "seq" randomly weighted by other sequence, which contains the weights for each element in seq:

def weighted_choice(seq, weights):
    assert len(seq) == len(weights)

    total = sum(weights)
    r = random.uniform(0, total)
    upto = 0
    for i in range(len(seq)):
        if upto + weights[i] >= r:
            return seq[i]
        upto += weights[i]
    assert False, "Shouldn't get here"

If I call the above a million times with a 1000 element sequence, like this:

seq = range(1000)
weights = []
for i in range(1000):
    weights.append(random.randint(1,100))

st=time.time()
for i in range(1000000):
    r=weighted_choice(seq, weights)
print (time.time()-st)

it runs for approximately 45 seconds in cpython 2.7 and for 70 seconds in cpython 3.6. It finishes in around 2.3 seconds in pypy 5.10, which would be fine for me, sadly I can't use pypy for some reasons.

Any ideas on how to speed up this function on cpython? I'm interested in other implementations (algorithmically, or via external libraries, like numpy) as well if they perform better.

ps: python3 has random.choices with weights, it runs for around 23 seconds, which is better than the above function, but still exactly ten times slower than pypy can run.

I've tried it with numpy this way:

weights=[1./1000]*1000
st=time.time()
for i in range(1000000):
    #r=weighted_choice(seq, weights)
    #r=random.choices(seq, weights)
    r=numpy.random.choice(seq, p=weights)
print (time.time()-st)

It ran for 70 seconds.

Possible duplicate of [A weighted version of random.choice](https://stackoverflow.com/questions/3679694/a-weighted-version-of-random-choice) — user2699, Mar 08 '18 at 14:49

FHTMitchell · Answer 1 · 2018-03-08T15:53:53.417

2

You can use numpy.random.choice (the p parameter is the weights). Normally numpy functions are vectorized and so run at near-C speed.

Implement as:

def weighted_choice(seq, weights):
    w = np.asarray(weights)
    p = w / w.sum()  # can skip if weights always sum to 1
    return np.random.choice(seq, p=w)

Edit:

Timings:

%timeit np.random.choice(x, p=w)  # len(x) == 1_000_000
13 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.random.choice(y, p=w)  # len(y) == 100_000_000
1.28 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Mar 08 '18 at 15:53

answered Mar 08 '18 at 14:30

FHTMitchell

11,793
2
35
47

I've tried it and it's also too slow compared to pypy. – user582175 Mar 08 '18 at 15:41
1

I'm shocked that numpy would be so much slower than pypy. Numpy is *normally near* C speed. Do you mind showing your implementation and timings with numpy in the OP? – FHTMitchell Mar 08 '18 at 15:49
1

I think your timing method is faulty. PyPy works by caching intermediate results. If it sees your doing the same operation over again with the same unchanging arguments, it will cache that and just return it to you over and over again. I don't believe pypy is actually running your function a million times in 2.5 seconds. Use the `timeit` module instead. – FHTMitchell Mar 08 '18 at 16:11
I've included numpy code. Also, changed the pypy loop to pass the iterator variable and I can confirm it runs the function 1M times. – user582175 Mar 08 '18 at 17:14
Right. That is not how you should do timings in python. It is not very realistic. Use the `timeit` module on just one call to your `weighted_choice` function in CPython and PyPy, and then do the same with numpy. I think you'll find the results very different. If the results stay as they are I will be very impressed with PyPy. – FHTMitchell Mar 08 '18 at 17:18

score 0 · Accepted Answer · answered Mar 08 '18 at 17:10

you could take this approach with numpy. If you emlimiate the for loop, you can get the true power of numpy by indexing the positions you need

#Untimed since you did not
seq = np.arange(1000)
weights = np.random.randint(1,100,(1000,1))


def weights_numpy(seq,weights,iterations):
    """
    :param seq: Input sequence
    :param weights: Input Weights
    :param iterations: Iterations to run
    :return: 
    """
    r = np.random.uniform(0,weights.sum(0),(1,iterations)) #create array of choices
    ar = weights.cumsum(0) # get cumulative sum
    return seq[(ar >= r).argmax(0)] #get indeces of seq that meet your condition

And the timing (python 3,numpy '1.14.0')

%timeit weights_numpy(seq,weights,1000000)
4.05 s ± 256 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Which is a bit slower than PyPy, but hardly...

High performance weighted random choice for python 2?

2 Answers2