Faster for-loops with arrays in Python

Question

N, M = 1000, 4000000
a = np.random.uniform(0, 1, (N, M))
k = np.random.randint(0, N, (N, M))

out = np.zeros((N, M))
for i in range(N):
    for j in range(M):
        out[k[i, j], j] += a[i, j]

I work with very long for-loops; %%timeit on above with pass replacing the operation yields

1min 19s ± 663 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

this is unacceptable in context (C++ took 6.5 sec). There's no reason for above to be done with Python objects; arrays have well-defined types. Implementing this in C/C++ as an extension is an overkill on both developer and user ends; I'm just passing arrays to loop and do arithmetic on.

Is there a way to tell Numpy "move this logic to C", or another library that can handle nested loops involving only arrays? I seek it for the general case, not workarounds for this specific example (but if you have one I can open a separate Q&A).

Are you looking for compilers for quasi-Python as in Cython, numba, etc. or in other approaches, as in vectorised numpy operations? — MisterMiyagi, Oct 26 '20 at 16:11
@MisterMiyagi I'm looking for anything that lets me keep code in the same `.py` I'm working on instead of forcing creation of a separate file - but worst-case if I am to create it, it should be in Python and not another language. See [this answer's](https://stackoverflow.com/a/64526158/10133797) `ne.evaluate` for example. — OverLordGoldDragon, Oct 26 '20 at 16:13
@MisterMiyagi Unsure what you mean by "compilers", anyone with the standard C implementation of Python should be able to run it, and extra libraries should handle their own dependencies (including compilers if needed). -- and no vectorization etc, this isn't an algorithm optimization question. — OverLordGoldDragon, Oct 26 '20 at 16:17
@MisterMiyagi: Cython would seem to be ruled out by not wanting their own extension, but `numba` is a possibility, assuming conversion to vectorized `numpy` operations isn't feasible. — ShadowRanger, Oct 26 '20 at 16:19
@ShadowRanger Cython can do ximports and generally handles ``.py`` files as well. I'm just not sure what are the requirements. Currently, the question reads awfully broad and "recommend me *something*"'ish. — MisterMiyagi, Oct 26 '20 at 16:22
@ShadowRanger I'm fine with a "wrapper" that makes a separate file or whatever it needs, for example TensorFlow has `@tf.function` that converts Python code into a C graph, but it's for GPU computing and total overkill here. — OverLordGoldDragon, Oct 26 '20 at 16:23
@OverLordGoldDragon: For something along those lines, [`@numba.jit(nopython=True)` would be the first thing I'd think of](https://numba.readthedocs.io/en/stable/user/jit.html). Whether it will fully optimize your case, I can't say, but it's worth a shot (it's by far the simplest tweak). I'll note, your code as rendered is not in a function, which will make standard CPython slower (just wrapping it in a function would change every read/write to your variables from a `dict` lookup to a C array indexing operation). — ShadowRanger, Oct 26 '20 at 16:27
@OverLordGoldDragon: I tested it on a local machine. Without wrapping in a function (but reducing `M` to 40000), it took about 29.1 seconds user time; wrapping it in a function dropped it to 25.5 seconds (small, but meaningful change), and decorating that function with `@numba.jit(nopython=True)` dropped it to 2.5 seconds (though the first time it ran the wall clock time was ~12.4 seconds, with the second run dropping to 3.6; loading `numba` itself and `jit`ing has some non-trivial startup costs, especially if, as in my case, the library has to be cached from NFS the first time). — ShadowRanger, Oct 26 '20 at 16:39
@ShadowRanger Meanwhile, [my results](https://i.stack.imgur.com/qRxha.png); completely smashed for-loop overhead. I wonder what one has to gain with Numba vs explicitly moving to C, guess I'll look into it - but I'd be surprised Numba can't beat naive C implementations (no tricks etc). — OverLordGoldDragon, Oct 26 '20 at 16:42
@ShadowRanger Are the 'startup costs' just at import, or each time some new functionality is called the first time (and then it's cached somewhere)? Thanks ahead — OverLordGoldDragon, Oct 26 '20 at 16:47
@OverLordGoldDragon: Ah, my test still incorporated your actual work (it wasn't just a `pass`), which, ideally, you'd find a way to vectorize so the Python level loops aren't needed in the first place, but even so, yes, the end result is much improved by `numba`. The startup costs are two-fold: 1) Actually loading all the dependencies from disk (paid first time you use `numba` after boot, and again any time you don't use it for a long while and it drops from disk cache; it's a large library), and 2) `jit`ing the function the first time it's called (cached for run, but regenerated on new runs). — ShadowRanger, Oct 26 '20 at 16:47
@OverLordGoldDragon: In my case, cost #1 (paid rarely) was ~9 seconds (like I said, it was on an NFS mount, so loading into cache was unusually slow); cost #2 (paid the first time a `jit`ed function is called, with the results cached for when its called again in the same run of the program) seemed to be ~1 second. — ShadowRanger, Oct 26 '20 at 16:51
@ShadowRanger pfft yeah that's nothing compared to savings, still good to know - thanks for sharing. Maybe it's time I looked in one of those "top X useful py libraries" instead of hammering everything myself. — OverLordGoldDragon, Oct 26 '20 at 16:53
The first time you call a `njit` decorated function there is a small overhead for the time needed for the jit compilation. This is in the order of ms usually for normal cases. In principle you can also compile them before execution, but that's not that straightforward. Also note, once the kernel compiles a function, and you are in an interactive session, modifying the function won't force recompilation, you need to clear the cache: https://stackoverflow.com/questions/44131691/how-to-clear-cache-or-force-recompilation-in-numba — dzang, Oct 26 '20 at 18:31

score 5 · Accepted Answer · edited Oct 26 '20 at 17:45

5

This is basically the idea behind Numba. Not as fast as C, but it can get close... It uses a jit compiler to compile python code to machine and it's compatible with most Numpy functions. (In the docs you find all the details)

import numpy as np
from numba import njit


@njit
def f(N, M):
    a = np.random.uniform(0, 1, (N, M))
    k = np.random.randint(0, N, (N, M))

    out = np.zeros((N, M))
    for i in range(N):
        for j in range(M):
            out[k[i, j], j] += a[i, j]
    return out


def f_python(N, M):
    a = np.random.uniform(0, 1, (N, M))
    k = np.random.randint(0, N, (N, M))

    out = np.zeros((N, M))
    for i in range(N):
        for j in range(M):
            out[k[i, j], j] += a[i, j]
    return out

Pure Python:

%%timeit

N, M = 100, 4000
f_python(M, N)

338 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With Numba:

%%timeit

N, M = 100, 4000
f(M, N)

12 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Oct 26 '20 at 17:45

OverLordGoldDragon

1
9
53
101

answered Oct 26 '20 at 16:29

dzang

2,160
2
12
21

1

Excellent. Plus, your example underpins the [smashing](https://i.stack.imgur.com/qRxha.png) of Python's for-loop overhead. I'll keep the question open a little longer, but doubt anything will beat this. – OverLordGoldDragon Oct 26 '20 at 16:44
1

@OverLordGoldDragon Do you actually believe that it did 400 million iterations in 451 ns? – Kelly Bundy Nov 03 '20 at 01:13
@HeapOverflow I realized shortly after that it's absurd, and to test I passed an array with +=1 into the loop to make sure the loop wasn't being skipped, and execution time barely budged - then probably some other optimization was done, but I didn't pursue it further. -- For others, Heap is referring to the 4e8/5e-7 = 1e15 Hz CPU (one core) that's required. – OverLordGoldDragon Nov 03 '20 at 01:52

Faster for-loops with arrays in Python

1 Answers1

Linked