Improving performance of Python operation using Numpy, Numba, Cython, etc

Question

I am running a simulation which has two methods which are called every iteration of the simulation and taking up most of computation time.

The methods are:

def calculate_macrovars(f, n, ns, ex, ey, dens, velx, vely):
    fd = f[:,2:n+2,2:n+2]
    dens[:,:] = fd[0,:,:]
    velx[:,:] = 0.
    vely[:,:] = 0.
    for s in range(1,ns):
        fds = fd[s,:,:]
        dens += fds
        velx += ex[s]*fds
        vely += ey[s]*fds        
    dens[:,:] = dens
    velx[:,:] /= dens 
    vely[:,:] /= dens

and

def do_relaxation(f, ftmp, n, ns, ws, ex, ey, cssq, omega, dens, velx, vely):
    dist = f[:,2:n+2,2:n+2]
    dist_tmp = ftmp[:,2:n+2,2:n+2]
    vv = (velx*velx + vely*vely)/cssq
    for s in range(ns):
        ev = (ex[s]*velx + ey[s]*vely)/cssq
        dist_eq = ws[s]*dens*(1 + ev + 0.5*ev*ev - 0.5*vv)
        dist_tmp[s,:,:] = (1. - omega)*dist[s,:,:] + omega*dist_eq

The arguments to these functions are 3d arrays f and ftmp (shape=nsxnxn dtype=np.float32), scalars n and ns (int), 1d array ws (shape=nsx1 dtype=np.float32), 1d arrays ex and ey(shape=nsx1 dtype=np.int8), scalar cssq and omega (float), 2d arrays dens, velx, vely (shape=nxn dtype=np.float32)

I profiled a simulation run and found each method respectively takes up 25% and 69% of the time for a total of 94% of the computation time as shown below (I have trimmed the profiles to remove all lines which take negligible time to run):

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           def tgv_simulation():
    84      4981     16577491   3328.1     24.9          calculate_macrovars()
    89      4981     45899937   9215.0     68.9          do_relaxation()
    91      4981      3614229    725.6      5.4          do_streaming()

Total time: 66.6111 s

Profiling calculate_macrovars() yields:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           def calculate_macrovars():
     5      4981        16353      3.3      0.1      fd = f[:,2:n+2,2:n+2]
     6      4981       376653     75.6      2.2      dens[:,:] = fd[0,:,:]
     7      4981       256929     51.6      1.5      velx[:,:] = 0.
     8      4981       246298     49.4      1.4      vely[:,:] = 0.
     9     44829        41229      0.9      0.2      for s in range(1,ns):
    10     39848        47308      1.2      0.3          fds = fd[s,:,:]
    11     39848      3739018     93.8     21.6          dens += fds
    12     39848      5699723    143.0     32.9          velx += ex[s]*fds
    13     39848      5428619    136.2     31.4          vely += ey[s]*fds        
    15      4981       723137    145.2      4.2      velx[:,:] /= dens 
    16      4981       719715    144.5      4.2      vely[:,:] /= dens

Total time: 17.3031 s

Profiling do_relaxation() yields:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    19                                           def do_relaxation():
    22      4981        17183      3.4      0.0      dist = f[:,2:n+2,2:n+2]
    23      4981         6700      1.3      0.0      dist_tmp = ftmp[:,2:n+2,2:n+2]
    24      4981      1510181    303.2      3.3      vv = (velx*velx + vely*vely)/cssq
    25     49810        61256      1.2      0.1      for s in range(ns):
    26     44829     12988964    289.7     28.7          ev = (ex[s]*velx + ey[s]*vely)/cssq
    27     44829     18644169    415.9     41.2          dist_eq = ws[s]*dens*(1 + ev + 0.5*ev*ev - 0.5*vv)
    28     44829     11993650    267.5     26.5          dist_tmp[s,:,:] = (1. - omega)*dist[s,:,:] + omega*dist_eq

Total time: 45.2221 s

What sort of optimization strategies should i employ to improve performance on these two methods?

Update: Vectorizing calculate_macrovars:

def calculate_macrovars(f, dens, velx, vely):
    fd = f[:,2:n+2,2:n+2]
    dens[:,:] = numpy.sum(fd, axis=0)
    velx[:,:] = numpy.sum(fd*ex[:, None, None], axis=0)/dens
    vely[:,:] = numpy.sum(fd*ey[:, None, None], axis=0)/dens

profiled:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           def calculate_macrovars(f, n, ns, ex, ey, dens, velx, vely):
     5      4981        15644      3.1      0.1      fd = f[:,2:n+2,2:n+2]
    16      4981      4232570    849.7     23.9      dens[:,:] = numpy.sum(fd, axis=0)
    17      4981      6887395   1382.7     38.9      velx[:,:] = numpy.sum(fd*ex[:, None, None], axis=0)/dens
    18      4981      6557733   1316.5     37.1      vely[:,:] = numpy.sum(fd*ey[:, None, None], axis=0)/dens

Total time: 17.6933 s

Approximately the same performance as the loop-ed version of calculate_macrovars.

have you had a look at [Multiprocessing](https://docs.python.org/2/library/multiprocessing.html) ? — DrBwts, Feb 03 '17 at 12:45
@DrBwts - I have considered it and also pyCUDA but i was unsure how to apply it just to these two methods and decided to look if it was possible to squeeze some more performance out of the sequential implementation for these two methods. I'd prefer not to rewrite the rest of the code. — nluigi, Feb 03 '17 at 12:53
Are the `s` loops inherently iterative, or you just haven't figured how to write them in vector form? `dens=fd.sum(axis=0)` etc — hpaulj, Feb 03 '17 at 12:53
@hpaulj - indeed the sum on `dens` can be vectorized but for `velx` and `vely` i have not found a way to vectorize; this is because of the chosen memory layout of `fd` which doesn't allow an `np.dot` operation on `ex` and `fd`. Only if changed to shape `nxnxns` is it possible to vectorize `velx` and `vely` but then the memory access to `fd` has become inefficient basically undoing the performance gain. — nluigi, Feb 03 '17 at 13:00
Ultimately for iterations over large data sets parallel is the way to go. Its worth the pain if you are doing this kind of thingsa lot. — DrBwts, Feb 03 '17 at 13:00
`(fd*ex[:, None, None]).sum(axis=0)`? `np.einsum('i...,i', fd, ex)`? — hpaulj, Feb 03 '17 at 13:06
@hpaulj - ok that is interesting, i guess it is possible to vectorize. Unfortunately the vectorized code is approximately 2x slower than the looped version of the code given in the question — nluigi, Feb 03 '17 at 13:18
@hpaulj - ok i update my answer with the vectorized code, it turn out to be performing on par with the looped code (not 2x slower) — nluigi, Feb 03 '17 at 13:31
if vectorization does not speed up your simulations you can try opposite strategy and realize all by loops and use jit-compilation by @numba.jit (it may be important in this case to specify the arguments types explicitly). As I remember it helps in some cases. — Roman Fursenko, Feb 03 '17 at 14:55
@hpaulj - its in the question along with the other shapes and dtypes. — nluigi, Feb 03 '17 at 15:57
Well then tell me what `n` and `ns` are? n=3 or 100 or 1000? — hpaulj, Feb 03 '17 at 15:59
@hpaulj - sorry didnt realize you wanted actual values, `n=1000` and `ns=9` — nluigi, Feb 03 '17 at 16:03
So the loop was for 9 times, working with (1000,1000) arrays. The number of loops is small compared to the computations. In that case the vectorization that I suggested doesn't help much, if any. I explored this at http://stackoverflow.com/questions/41942115/numpy-efficient-large-dot-products. I'm not optimistic about `numba` helping either, since most of the work is being done in compiled `numpy` code already. — hpaulj, Feb 03 '17 at 17:13
@hpaulj OK, makes sense. I have had some success in the mean time with Numba, rewriting the methods to loop over all indices and turning on autojit in nopython mode has reduced run-time by factor 3. Maybe some additional gains can be had by properly setting variable types. — nluigi, Feb 03 '17 at 17:18

Improving performance of Python operation using Numpy, Numba, Cython, etc

0 Answers0