Why is np.cumprod() slower than my own function for large arrays?

Question

Doing a few tests, I noticed that a custom loop function I wrote could be faster than NumPy built-in function np.cumprod(). I didn't think it was possible, could someone explain what happens?

import time
import numpy as np

def custom_cumprod(x):
    xb = [1]
    for xi in x:
        xb.append(xb[-1]*xi)
    return np.array(xb[1:])

x = np.random.random((10000, 10000))

t0 = time.time()
b = custom_cumprod(x)
t1 = time.time()
print(t1-t0)
t0 = time.time()
c = np.cumprod(x,0)
t1 = time.time()
print(t1-t0)
print(np.sum(c-b))

Thanks!

You are multiplying on an axis where memory is not [contiguous](https://stackoverflow.com/questions/26998223/what-is-the-difference-between-contiguous-and-non-contiguous-arrays). Compare to `np.cumprod(x.T, 0)` — Michael Szczesny, Feb 07 '22 at 11:51
I checked on your test and got that internal cumprod uses significantly less memory. I guess that was chosen more memory-conservative approach, but it was in cost of speed. — Askold Ilvento, Feb 07 '22 at 11:54
I asked similar question in the past. In short, (1) `np.cumprod(x, 1)` gonna be faster due to memory layout, (2) your data is so big that you basically testing memory efficiency of your pc - with 100x100 numpy should be faster. — , Feb 07 '22 at 11:54

score 1 · Answer 1 · answered Feb 07 '22 at 12:02

The issue is that the computation runs along the outer dimension and numpy doesn't seem to optimize this case. By looping along this dimension, values are not consecutive in memory, resulting in cache and TLB misses.

This can be fixed by storing the array in Fortran order (np.asfortranarray), or transposing it and running the computation along the inner dimension.

perf shows the difference clearly:

perf stat -e LLC-load-misses,dTLB-load-misses python3 -c \
    'import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=0)'

 Performance counter stats for 'python3 -c import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=0)':

        22.254.639      LLC-load-misses:u                                           
       114.772.180      dTLB-load-misses:u                                          

       3,200727804 seconds time elapsed

       2,148422000 seconds user
       1,044664000 seconds sys

perf stat -e LLC-load-misses,dTLB-load-misses python3 -c \
    'import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=1)'

 Performance counter stats for 'python3 -c import numpy as np; np.cumprod(np.random.random((10000, 10000)), axis=1)':

           633.959      LLC-load-misses:u                                           
         2.485.725      dTLB-load-misses:u                                          

       2,276527365 seconds time elapsed

       0,947946000 seconds user
       1,316536000 seconds sys

Ideally numpy should run the computation the same way you did in your custom version.

Why is np.cumprod() slower than my own function for large arrays?

1 Answers1