N, M = 1000, 4000000
a = np.random.uniform(0, 1, (N, M))
k = np.random.randint(0, N, (N, M))
out = np.zeros((N, M))
for i in range(N):
for j in range(M):
out[k[i, j], j] += a[i, j]
I work with very long for-loops; %%timeit
on above with pass
replacing the operation yields
1min 19s ± 663 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this is unacceptable in context (C++ took 6.5 sec). There's no reason for above to be done with Python objects; arrays have well-defined types. Implementing this in C/C++ as an extension is an overkill on both developer and user ends; I'm just passing arrays to loop and do arithmetic on.
Is there a way to tell Numpy "move this logic to C", or another library that can handle nested loops involving only arrays? I seek it for the general case, not workarounds for this specific example (but if you have one I can open a separate Q&A).