generate a 2D array of numpy.random.choice without replacement

Question

I'm tyring to make my code faster by removing some for loops and using arrays. The slowest step right now is the generation of the random lists.

context: I have a number of mutations in a chromosome, i want to perform 1000 random "chromosomes" with the same length and same number of mutation but their positions are randomized.

here is what I'm currently running to generate these randomized mutation positions:

iterations=1000
Chr_size=1000000
num_mut=500
randbps=[]
for k in range(iterations):
    listed=np.random.choice(range(Chr_size),num_mut,replace=False)
    randbps.append(listed)

I want to do something similar to what they cover in this question

np.random.choice(range(Chr_size),size=(num_mut,iterations),replace=False)

however without replacement applies to the array as a whole.

further context: later in the script i go through each randomized chromosome and count the number of mutations in a given window:

for l in range(len(randbps)):

    arr=np.asarray(randbps[l])

    for i in range(chr_last_window[f])[::step]:
    
        counter=((i < arr) & (arr < i+window)).sum()

You don't want any of the `listed`s to be the same, meaning `randbps`'s elements should be unique? — ayhan, May 03 '16 at 19:00
This is caused by a long standing performance issue: https://github.com/numpy/numpy/issues/2764 — , May 05 '16 at 11:31

score 1 · Answer 1 · edited May 23 '17 at 12:23

1

Based on the trick used in this solution, here's an approach that uses argsort/argpartition on an array of random elements to simulate numpy.random.choice without replacement to give us randbps as a 2D array -

np.random.rand(iterations,Chr_size).argpartition(num_mut)[:,:num_mut]

Runtime test -

In [2]: def original_app(iterations,Chr_size,num_mut):
   ...:     randbps=[]
   ...:     for k in range(iterations):
   ...:         listed=np.random.choice(range(Chr_size),num_mut,replace=False)
   ...:         randbps.append(listed)
   ...:     return randbps
   ...: 

In [3]: # Input params (scaled down version of params listed in question)
   ...: iterations=100
   ...: Chr_size=100000
   ...: num=50
   ...: 

In [4]: %timeit original_app(iterations,Chr_size,num)
1 loops, best of 3: 1.53 s per loop

In [5]: %timeit np.random.rand(iterations,Chr_size).argpartition(num)[:,:num]
1 loops, best of 3: 424 ms per loop

edited May 23 '17 at 12:23

Community

1
1

answered May 03 '16 at 19:23

Divakar

218,885
19
262
358

1

this code works great in my script. Thanks for the great tip. – Luke Anderson- Trocme May 03 '16 at 20:32
1

when i up the number of iterations in my script to 1000 my computer crashes hard. I think it may be because the entire array is kept in memory. ~46Gb of memory requirements doesn't fit in my 16Gb laptop. – Luke Anderson- Trocme May 03 '16 at 21:45
@LukeAnderson-Trocme Yeah, really huge arrays would be a problem I think. – Divakar May 03 '16 at 21:47
1

yes, that makes sense. Thanks for the answer though! – Luke Anderson- Trocme May 09 '16 at 20:34

ayhan · Accepted Answer · 2016-05-03T20:05:22.240

I don't know how np.random.choice is implemented but I am guessing it is optimized for a general case. Your numbers, on the other hand, are not likely to produce the same sequences. Sets may be more efficient for this case, building from scratch:

import random

def gen_2d(iterations, Chr_size, num_mut):
    randbps = set()
    while len(randbps) < iterations:
        listed = set()
        while len(listed) < num_mut:
            listed.add(random.choice(range(Chr_size)))
        randbps.add(tuple(sorted(listed)))
    return np.array(list(randbps))

This function starts with an empty set, generates a single number in range(Chr_size) and adds that number to the set. Because of the properties of the sets it cannot add the same number again. It does the same thing for the randbps as well so each element of randbps is also unique.

For only one iteration of np.random.choice vs gen_2d:

iterations=1000
Chr_size=1000000
num_mut=500

%timeit np.random.choice(range(Chr_size),num_mut,replace=False)
10 loops, best of 3: 141 ms per loop

%timeit gen_2d(1, Chr_size, num_mut)
1000 loops, best of 3: 647 µs per loop

this one works faster than the other answer, and also doesn't take up as much memory. Much better! — Luke Anderson- Trocme, May 03 '16 at 22:06
I couldn't completely graps the idea behind @Divakar's answer but it's about the mutation rate really. If the mutation rate were 50% that answer would be 100x faster than mine. Both have their strengths and weaknesses. — ayhan, May 03 '16 at 22:31
There's also [`random.sample`](https://docs.python.org/2/library/random.html#random.sample) which can work directly with an `xrange` (in Python 2.x). — , May 05 '16 at 11:26

generate a 2D array of numpy.random.choice without replacement

2 Answers2

Linked