How to get the hardest part to take longest in python

Question

I have some simple code as follows.

count =0 
iters = 1000000 
l=10 
k=10 
for i in xrange(iters):
    t = np.random.choice([-1,1],size=l+k-1)
    v = np.random.choice([-1,1], size = l)
    if (not np.convolve(v, t, 'valid').any()):
        count += 1

print count

When I profile this I was surprised to see that only a small fraction of the time is spent doing np.convolve which in theory should be the most expensive part. Is there any way to speed up the other parts so that np.convolve is taking the majority of the time?

I would happily use pypy but sadly it doesn't support np.convolve yet.

Is there any way to make it fast enough that iters = 100 million and l = 12, k= 12 can terminate?

Aside: `v` is made up of -1 and 1. When will `np.all(v==0)` ever be true? (Assuming `l != 0`, of course, but then the convolution would fail.) — DSM, Apr 02 '14 at 20:15
I must be missing something obvious. If `v = np.array([-1, 1])`, then `(v == 0)` == `array([False, False], dtype=bool)`, and `np.all(v == 0) == False`. — DSM, Apr 02 '14 at 20:18
@felix, But `np.random.choice([-1, 1])` only selects either -1 or 1, so `np.all(v == 0)` is always False. — wflynny, Apr 02 '14 at 20:19
Oh sorry I am misunderstanding the question! It's a typo. Let me fix it. I thought you were asking when `not np.convolve(v, t, 'valid').any()` could be true. — Simd, Apr 02 '14 at 20:19
You could try `tv = np.random.choice([-1,1], size = l+k-1+l); if (not np.convolve(tv[:l], tv[l:], 'valid').any()))` to see if that helps any. — DSM, Apr 02 '14 at 20:27
`(np.random.random(l+k-1+l) > 0.5)*2 - 1` should generate the random array slightly faster. — M4rtini, Apr 02 '14 at 20:36
FYI I've updated my answer for multi processing: this brings the simulation time down by another factor depending on how many cores you have available. — TooTone, Apr 03 '14 at 14:21
@TooTone I am just trying it out now! Thank you. What is the magic number `N=13`? — Simd, Apr 04 '14 at 08:41
@felix the program uses 8 processes, so each process has to run 13 million trials to get 8*13 = 104 million trials altogether. — TooTone, Apr 04 '14 at 10:16

score 5 · Accepted Answer · edited May 23 '17 at 12:05

Edit: using a combination of generating the random numbers in a block and multiprocessing, I got 100 million trials done in 6 minutes on my latop, altogether about 12 times faster than the original code.

single process with block generation of random numbers

The following code runs 3x faster on my machine (15s vs 45s). The main change is to move all random number generation out of the main loop. If iters is so large that you don't have the memory to do that, then you can run a nested loop and generate as large a block as you can afford and rinse and repeat -- I've put the code for this below following the edit to your question.

import numpy as np

count = 0 
iters = 1000000

l=10 
k=10
l0=l+k-1

t = np.random.choice([-1,1], size = l0 * iters)
v = np.random.choice([-1,1], size = l  * iters)

for i in xrange(iters):
    if (not np.convolve(v[(l*i):(l*(i+1))], t[(l0*i):(l0*(i+1))], 'valid').any()):
        count += 1

print count

The other v minor change that improved the performance of your original code by about 2% was to move the calculation l+k-1 outside of the loop. Incidentally, I found that the way you deal with count is quite efficient. So, for example, count += np.convolve(v[(l*i):(l*(i+1))], t[(l0*i):(l0*(i+1))], 'valid').any() and then doing iters - count at the end is slower. This is because the condition not...any() is very rare, and the way you have it now you touch count very rarely.

To run 100 million times set N=100 in the program below. With the current value of N=4 the program took 1 minute to execute (with a count of 26). With l=12, k=12, and N=4 the program took just over a minute to execute (with a count of 4). So you should be looking at less than half an hour for 100 million.

import numpy as np, time

start = time.clock()

count = 0 
iters = 1000000 # 1million

l=10 
k=10
l0=l+k-1

N = 4 # number of millions

for n in range(N):
    t = np.random.choice([-1,1], size=l0 * iters)
    v = np.random.choice([-1,1], size = l * iters)

    for i in xrange(iters):
        if (not np.convolve(v[(l*i):(l*(i+1))], t[(l0*i):(l0*(i+1))], 'valid').any()):
            count += 1

print (time.clock() - start)

print count

multiple processes

Edit: this is an "embarassingly parallel" problem, so you can run the simulations on multiple processors. An easy way to do this is using a pool of workers. Note however it's important to set the random seed in each process. Otherwise you risk having each process use the same set of random numbers (see here, I'm assuming this applies to numpy random as well as to the base random). The code is below.

import numpy as np, time
from multiprocessing import Pool

def countconvolve(N):
    np.random.seed() # ensure seed is random

    count = 0 
    iters = 1000000 # 1million

    l=12 
    k=12
    l0=l+k-1

    for n in range(N):
        t = np.random.choice([-1,1], size=l0 * iters)
        v = np.random.choice([-1,1], size = l * iters)

        for i in xrange(iters):
            if (not np.convolve(v[(l*i):(l*(i+1))], t[(l0*i):(l0*(i+1))], 'valid').any()):
                count += 1

    return count

if __name__ == '__main__':
    start = time.clock()

    num_processes = 8
    N = 13

    pool = Pool(processes=num_processes)
    res = pool.map(countconvolve, [N] * num_processes)    
    print res, sum(res)

    print (time.clock() - start)

It ran 104 million simulations in 370 seconds, and produced the following output

[4, 9, 10, 6, 7, 8, 11, 9] 64

My laptop is a core-i7 with 8GB of memory, so with 8 cores I got a 4x speedup by parallel processing. I found the memory usage was about 160MB per process (with a higher peak). If you have fewer cores or less memory you would want to adjust the parameters in the program accordingly.

With @moarningsun's suggestion of constraining the array to have 1 byte per element, the memory usage dropped to 60MB per process.

        t = np.random.choice(np.array([-1,1], dtype=np.int8), size=l0 * iters)
        v = np.random.choice(np.array([-1,1], dtype=np.int8), size = l * iters)

I believe you can reduce memory use by taking the random choice from a `np.array([-1, 1], dtype=np.int8)`. Also, you could reshape `t` and `v` to let numpy handle the indexing maths, but I don't know if this is significant. — , Apr 03 '14 at 14:24
By the way, your laptop has only 4 physical cores (and 4 "virtual" cores supplied by [hyper-threading](http://en.wikipedia.org/wiki/Hyper-threading), so the scaling with multiprocessing is actually very good. — , Apr 03 '14 at 14:34
@moarningsun thanks a lot. The memory usage tip worked very well. The `reshape` not so much, which is a pity because I thought it was a really sweet idea and it simplified the code. I also tried directly setting the `size` to a 2-tuple in `np.random.choice` but this didn't help either (in fact, the original way was fastest, which I find odd). — TooTone, Apr 03 '14 at 15:25
@moarningsun By the way re hyper-threading, you're right, the extra virtualized cores do make a difference but not a big difference. — TooTone, Apr 03 '14 at 15:43
Thank you for this. Sadly it has killed my computer several times due to RAM problems when I try larges iters. Is there some way to limit the RAM usage? — Simd, Apr 04 '14 at 09:38
@felix oh dear. The total number of trials is `num_processes*N*iters`. So to limit the RAM usage, decrease `iters` and increase `N` by the same amount. Similarly to reduce the number of processes decrease `num_processes` and increase `N`. You could try `iters = 100000` and `num_processes=4` which will reduce memory consumption by a factor 20. Then you need to set N to 100,000,000/100000/4, i.e. `N=250` to get 100 million trials altogether. I'm also assuming you used the last 2 lines of code from moarningsun's suggestion which made a BIG difference to memory usage. — TooTone, Apr 04 '14 at 10:21
Out of interest, given that this code is 100% cpu bound, why don't you get close to an 8x speedup? I seem to get a 5x speedup on my 8 code AMD processor. — Simd, Apr 06 '14 at 21:36
@felix Does that mean you've got the code to work on your machine now? That's good news if so :). There are several reasons why you wouldn't get perfect speedup. E.g. on my machine, there is hyperthreading, which means that the 8 cores are only really 4 cores with hardware support on each core for threading. More generally, there is an overhead to setting up the multiple threads, and even once it's going there will be competition for resources such as memory, etc. NB I don't think the code is 100% cpu bound, there will almost certainly be significant delays in memory access. It's complicated! — TooTone, Apr 06 '14 at 21:40

score 1 · Answer 2 · answered Apr 02 '14 at 20:41

As @DSM and @M4rtini are implying, np.random.choice is the bottleneck (although 1e6 iterations with l=k=10 takes about 19 seconds on my machine). Here are some timeit results which compare the methods

# Current
In [20]: %timeit import numpy as np; t = np.random.choice([-1, 1], size=21); v = np.random.choice([-1, 1], size=10); c = np.convolve(v, t, 'valid');
100000 loops, best of 3: 19.2 us per loop
# @DSM
In [25]: %timeit import numpy as np; tv = np.random.choice([-1, 1], size=31); c = np.convolve(tv[:10], tv[10:], 'valid');
100000 loops, best of 3: 12.4 us per loop
# @M4rtini
In [29]: %timeit import numpy as np; tv = (np.random.random(31) > 0.5)*2-1; c = np.convolve(tv[:10], tv[10:], 'valid');
100000 loops, best of 3: 8.18 us per loop

What values of k,l,iters are you using in practice? If they aren't much larger than your example, these suggestions are more than sufficient I think. I'll add additional timeits as they come along.

I would really like to set iters to 100 million with l = k = 12. This is just too slow on my machine. — Simd, Apr 02 '14 at 21:15

How to get the hardest part to take longest in python

2 Answers2