How to fill 3D array in numpy with functional values at the same speed as Java?

Question

I tried to model voxels of 3D cylinder with the following code:

import math
import numpy as np

R0 = 500
hz = 1

x = np.arange(-1000, 1000, 1)
y = np.arange(-1000, 1000, 1)
z = np.arange(-10, 10, 1)

xx, yy, zz = np.meshgrid(x, y, z)


def density_f(x, y, z):
    r_xy = math.sqrt(x ** 2 + y ** 2)
    if r_xy <= R0 and -hz <= z <= hz:
        return 1
    else:
        return 0


density = np.vectorize(density_f)(xx, yy, zz)

and it took many minutes to compute.

Equivalent suboptimal Java code runs 10-15 seconds.

How to make Python compute voxels at the same speed? Where to optimize?

Your bio lists MATLAB. Originally that was a wrapper for Fortran matrix code. To get good speed you had to think in terms of whole arrays. New MATLAB has a jit compiler that lets you 'cheat' and write iterative code. `numpy` is more like the older MATLAB. To get around that you have use added packages like `cython` and `numba` that create custom compiled code. I assume you know enough computer science to know the difference between C, Java, and Python. — hpaulj, Oct 27 '18 at 00:30
The question of how do a fast vectorization comes up often. Here's a simpler recent example: https://stackoverflow.com/questions/53016316/how-to-efficiently-operate-a-large-numpy-array — hpaulj, Oct 27 '18 at 04:23
Slightly unrelated to your question, but compare square of xy distance with precomputed square of radius - you can save sqrt in each iteration, which should make a big difference. — Artur Biesiadowski, Oct 29 '18 at 12:00

Willem Van Onsem · Answer 1 · 2018-10-26T22:19:32.457

2

Please do not use .vectorize(..), it is not efficient since it will still do the processing at the Python level. .vectorize() should only be used as a last resort if for example the function can not be calculated in "bulk" because its "structure" is too complex.

But you do not need to use .vectorize here, you can implement your function to work over arrays with:

r_xy = np.sqrt(xx ** 2 + yy ** 2)
density = (r_xy <= R0) & (-hz <= zz) & (zz <= hz)

or even a bit faster:

r_xy = xx * xx + yy * yy
density = (r_xy <= R0 * R0) & (-hz <= zz) & (zz <= hz)

This will construct a 2000×2000×20 array of booleans. We can use:

intdens = density.astype(int)

to construct an array of ints.

Printing the array here is quite combersome, but it contains a total of 2'356'047 ones:

>>> density.astype(int).sum()
2356047

Benchmarks: If I run this locally 10 times, I get:

>>> timeit(f, number=10)
18.040479518999973
>>> timeit(f2, number=10)  # f2 is the optimized variant
13.287886952000008

So on average, we calculate this matrix (including casting it to ints) in 1.3-1.8 seconds.

edited Oct 26 '18 at 22:19

answered Oct 26 '18 at 21:51

Willem Van Onsem

443,496
30
428
555

What about more complex cases with `if`-s? Can I apply functions elementwise efficiently? – Dims Oct 26 '18 at 21:53
P.S. Why are you using single ampersands? – Dims Oct 26 '18 at 21:54
@Dims: because `and` in Python evaluates truthiness, and an array has no truthiness. `&` is the "elementwise" and of two matrices. It can require some "creativity" to write it in such variant. This is something one "masters" over time with experience. – Willem Van Onsem Oct 26 '18 at 21:55
Thanks, but please explain, what to do with more complex functions, where are if-s, exceptions etc, which I can't write as single vectorized expression? – Dims Oct 26 '18 at 22:01
@Dims: there are no ifs, etc. you need to encode this in a logical way. The same for exceptions. It thus requires "mathematical engineering". – Willem Van Onsem Oct 26 '18 at 22:04
Sorry, there are ifs in my case. And also nested calls of other functions etc. And in many case it is just not oprtimal to allocate all intermediate results as ND arrays – Dims Oct 26 '18 at 22:19
@Dims: well numpy is designed to do things in "bulk", like calculating the elementwise sum of two arrays. Python on the otherhand is very slow for individual operations, so I think then both will not work out. But I have seen that some problems that look quite complicated at first, turn out to be element-wise operations, or convolutionals, etc. Note that `if` statements are quite slow in general due to pipelining, so sometimes it is better to just calculate both operands instead of using `if`s – Willem Van Onsem Oct 26 '18 at 22:22
If is not slow if it will avoid computation of 90% of space – Dims Oct 26 '18 at 23:16

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

You can also use a compiled version of the function to calculate density. You can use cython or numba for that. I use numba to jit compile the density calculation function in the ans, as it is as easy as putting in a decorator.

Pros :

You can write if conditions as you mention in your comments
Slightly faster than the numpy version mentioned in the ans by @Willem Van Onsem, as we have to iterate through the boolean array to calculate the sum in density.astype(int).sum().

Cons:

Write an ugly three level loop. Looses the beauty of the singlish liner numpy solution.

Code:

import numba as nb
@nb.jit(nopython=True, cache=True)
def calc_density(xx, yy, zz, R0, hz):
    threshold = R0 * R0
    dimensions = xx.shape

    density = 0
    for i in range(dimensions[0]):
        for j in range(dimensions[1]):
            for k in range(dimensions[2]):
                r_xy = xx[i][j][k] ** 2 + yy[i][j][k] ** 2
    
                if(r_xy <= threshold and -hz <= zz[i][j][k] <= hz):
                    density+=1
    return density

Running times:

Willem Van Onsem solution, f2 variant : 1.28s without sum, 2.01 with sum.

Numba solution( calc_density, on second run, to discount the compile time) : 0.48s.

As suggested in the comments, we need not calculate the meshgrid also. We can directly pass the x, y, z to the function. Thus:

@nb.jit(nopython=True, cache=True)
def calc_density2(x, y, z, R0, hz):
    threshold = R0 * R0
    dimensions = len(x), len(y), len(z)

    density = 0
    for i in range(dimensions[0]):
        for j in range(dimensions[1]):
            for k in range(dimensions[2]):
                
                r_xy = x[i] ** 2 + y[j] ** 2
                if(r_xy <= threshold and -hz <= z[k] <= hz):
                    density+=1
    return density

Now, for fair comparison, we also include the time of np.meshgrid in @Willem Van Onsem's ans. Running times:

Willem Van Onsem solution, f2 variant(np.meshgrid time included) : 2.24s

Numba solution( calc_density2, on second run, to discount the compile time) : 0.079s.

1) Don't use global variables! Within Numba every "global" variable is a compile time constant. If you change the global variable and run your function again, Numba won't recognize this changes, which leads to hard to debug errors. 2) A small change (passing x,x,z directly) leads to a factor 10 better performance. 3) Parallelization would give another factor of Number of Cores. — max9111, Oct 29 '18 at 13:58
@max9111, thanks. 1) yes, indeed. Slipped my mind. Edited to incorporate 2). Can't seem to get 3). I guess the bookkeeping required for parallelization dwarfs the gains for the specific size. — Deepak Saini, Oct 29 '18 at 14:42
@Deepak Saini I have posted quite a duplicate answer, apart from parallelization. I think it would be better to add the parallelized aproach to your answer and I delete mine afterwards... BTW: cache=True is not working if paralization is set to True. — max9111, Oct 29 '18 at 15:38

max9111 · Answer 3 · 2018-10-29T20:04:40.070

This is meant as a lengthy comment on the answer of Deepak Saini.

The main change is to not use the coordinates generated by np.meshgrid which contains unecessary repetitions. This isn't recommandable if you can avoid it (both in terms of memory usage and performance)

Code

import numba as nb
import numpy as np

@nb.jit(nopython=True,parallel=True)
def calc_density_2(x, y, z,R0,hz):
    threshold = R0 * R0

    density = 0
    for i in nb.prange(y.shape[0]):
        for j in range(x.shape[0]):
            r_xy = x[j] ** 2 + y[i] ** 2
            for k in range(z.shape[0]):
                if(r_xy <= threshold and -hz <= z[k] <= hz):
                    density+=1

    return density

Timings

R0 = 500
hz = 1

x = np.arange(-1000, 1000, 1)
y = np.arange(-1000, 1000, 1)
z = np.arange(-10, 10, 1)

xx, yy, zz = np.meshgrid(x, y, z)

#after the first call (compilation overhead)
#calc_density_2          9.7 ms
#calc_density_2 parallel 3.9 ms
#@Deepak Saini           115 ms

How to fill 3D array in numpy with functional values at the same speed as Java?

3 Answers3

Linked