optimizing vectorized operations made by sections in NumPy

Question

Long story short, I need to make vector operations over a 2D matrix with the values of the matrix itself for thousands of iterations, but for reasons I explain below, I need to do it in multiple sections, and I want to know the best way to do it by still getting the best possible performance and readibility.

I'm solving Laplace's equation in order to generate grids for computational aerodynamics simulations.

For this, let's say I have a 2D matrix called X of shape (M, N) where M and N are the number of rows and columns respectively, and I need to get the value of every internal node with "coordinates" (i, j), which is affected by its neighbors, points (i+1, j) (i-1, j) (i, j+1) (i, j-1). Take for example the next equation:

X[i, j] = (X[i+1, j] - X[i-1, j] + X[i, j+1] - X[i, j-1]) / 4

The code runs for several iterations, in the order of hundred of thousands, and in each iteration I need to traverse the whole matrix computing every internal node. The equation above states that the computations are made in the matrix itself, and that the values of X[i-1, j] and X[i, j-1] are the ones already computed in the current iteration.

So, that's the background of the problem in question, now to the code I'm writing. As a newbie I started with the obvious, not optimal, approach of two nested loops, one for rows and one for columns, which are already inside a while loop (number of iterations):

while current_it < it_max:
    for i in range(1, M-1):
        for j in range(1, N-1):
            X[i, j] = (X[i+1, j] - X[i-1, j] + X[i, j+1] - X[i, j-1]) / 4

This worked, and for small sizes of the matrix it executed in relatively low time, 5 minutes approx, I know that's already huge in execution time, but it wasn't really a problem. But I need big grids, for example a grid of size 1200 x 400, and in this cases the execution time grows exponentially and it takes DAYS to solve the grid, which is no longer affordable.

Thanks to this question, I realized I can vectorize the equation and get rid of the nested for loops, so now my code looks something like

while current_it < it_max:
    # replacements of i and j
    #  j or  i      -->   1:-1
    # (j or  i) + 1 -->   2:
    # (j or  i) - 1 -->   :-2
    X[1:-1, 1:-1] = (X[2:, 1:-1] - X[:-2, 1:-1] + X[1:-1, 2:] - X[1:-1, :-2]) / 4

This represents a HUGE improvement in execution time, a grid that in the classic approach would take 3 days to generate it now takes maybe 5 minutes.

The problem I have now is that I lost the ability of getting the values of (i-1) and (j-1) for the current iteration, and this is making the code execute much more iterations that I suspect are needed.

My solution to this is to divide the matrix in sections, and compute each piece at a time.

while current_it < it_max:
    # 1st piece [i, 1 : lim_1]
    # 2nd piece [i, lim_1 :]
    X[1:-1, 1:lim_1] = (X[2:, 1:lim_1] - X[:-2, 1:lim_1] \
            + X[1:-1, 2:lim_1 + 1] - X[1:-1, :lim_1 - 1]) / 4
    X[1:-1, lim_1:-1] = (X[2:, lim_1:-1] - X[:-2, lim_1:-1] \
            + X[1:-1, lim_1 + 1:] - X[1:-1, lim_1 - 1:-2]) / 4

But I know copy-pasting is bad practice, and also the lines of code are growing quickly since I need multiple sections in both i and j directions.

What would be the best way of re-arranging this final code in order to get the the best performance and readibility.

score 2 · Accepted Answer · answered Apr 11 '20 at 18:54

This is a kind of problem where one can really benefit from using numba. For the setup below, I get almost twice the speed of the numpy solution without sacrificing readability.

import numpy as np
from numba import jit

X = np.random.randn(100, 100)
it_max = 1000

@jit
def loopy(X):
  N, M = X.shape
  for itr in range(it_max):
    for i in range(1, M-1):
      for j in range(1, N-1):
        X[i, j] = (X[i+1, j] - X[i-1, j] + X[i, j+1] - X[i, j-1]) / 4
  return X


def vectory(X):
  for itr in range(it_max):
    # replacements of i and j
    #  j or  i      -->   1:-1
    # (j or  i) + 1 -->   2:
    # (j or  i) - 1 -->   :-2
    X[1:-1, 1:-1] = (X[2:, 1:-1] - X[:-2, 1:-1] + X[1:-1, 2:] - X[1:-1, :-2]) / 4
  return X


Xc = X.copy()
%timeit loopy(Xc)   # 10 loops, best of 3: 25.1 ms per loop
Xc = X.copy()
%timeit vectory(Xc) # 10 loops, best of 3: 43.1 ms per loop

optimizing vectorized operations made by sections in NumPy

1 Answers1