Improving Numpy speed for Gauss-Seidel (Jacobi) Solver

Question

This question is a follow-up to a recent question posted regarding MATLAB being twice as fast as Numpy.

I currently have a Gauss-Seidel solver implemented in both MATLAB and Numpy which acts on a 2D axisymmetric domain (cylindrical coordinates). The code was originally written in MATLAB and then transferred to Python. The Matlab code runs in ~20 s whereas the Numpy codes takes ~30 s. I would like to use Numpy, however, since this code is part of a larger program, the almost twice as long simulation time is a significant drawback.

The algorithm simply solves the discretized Laplace equation on a rectangular mesh (in cylindrical coordinates). It finishes when the maximum difference between updates on the mesh is less than the indicated tolerance.

The code in Numpy is:

import numpy as np
import time

T = np.transpose

# geometry
length = 0.008
width = 0.002

# mesh
nz = 256
nr = 64

# step sizes
dz = length/nz
dr = width/nr

# node position matrices
r = np.tile(np.linspace(0,width,nr+1), (nz+1, 1)).T
ri = r/dr

# equation coefficients
cr = dz**2 / (2*(dr**2 + dz**2))
cz = dr**2 / (2*(dr**2 + dz**2))

# initial/boundary conditions
v = np.zeros((nr+1,nz+1))
v[:,0] = 1100
v[:,-1] = 0
v[31:,29:40] = 1000
v[19:,54:65] = -200

# convergence parameters
tol = 1e-4

# Gauss-Seidel solver
tic = time.time()
max_v_diff = 1;
while (max_v_diff > tol):

    v_old = v.copy()

    # left boundary updates
    v[0,1:nz] = cr*2*v[1,1:nz] + cz*(v[0,0:nz-1] + v[0,2:nz+2])
    # internal updates
    v[1:nr,1:nz] = cr*((1 - 1/(2*ri[1:nr,1:nz]))*v[0:nr-1,1:nz] + (1 + 1/(2*ri[1:nr,1:nz]))*v[2:nr+1,1:nz]) + cz*(v[1:nr,0:nz-1] + v[1:nr,2:nz+1])
    # right boundary updates
    v[nr,1:nz] = cr*2*v[nr-1,1:nz] + cz*(v[nr,0:nz-1] + v[nr,2:nz+1])
    # reapply grid potentials
    v[31:,29:40] = 1000
    v[19:,54:65] = -200

    # check for convergence
    v_diff = v - v_old
    max_v_diff = np.absolute(v_diff).max()

toc = time.time() - tic
print(toc)

This is actually not the full algorithm which I use. The full algorithm uses successive overrelaxation and a checkerboard iteration scheme to improve speed and remove solver directionality, but for purposes of simplicity I provided this easier to understand version. The speed drawbacks in Numpy are more pronounced for the full version (17s vs. 9s simulation times respectively in Numpy and MATLAB).

I tried the solution from the previous question, changing v to a column-major order array, but there was no performance increase.

Any suggestions?

Edit: The Matlab code for reference is:

% geometry
length = 0.008;
width = 0.002;

% mesh
nz = 256;
nr = 64;

% step sizes
dz = length/nz;
dr = width/nr;

% node position matrices
r = repmat(linspace(0,width,nr+1)', 1, nz+1);
ri = r./dr;

% equation coefficients
cr = dz^2/(2*(dr^2+dz^2));
cz = dr^2/(2*(dr^2+dz^2));

% initial/boundary conditions
v = zeros(nr+1,nz+1);
v(1:nr+1,1) = 1100;
v(1:nr+1,nz+1) = 0;
v(32:nr+1,30:40) = 1000;
v(20:nr+1,55:65) = -200;

% convergence parameters
tol = 1e-4;
max_v_diff = 1;

% Gauss-Seidel Solver
tic
while (max_v_diff > tol)
    v_old = v;

    % left boundary updates
    v(1,2:nz) = cr.*2.*v(2,2:nz) + cz.*( v(1,1:nz-1) + v(1,3:nz+1) );
    % internal updates
    v(2:nr,2:nz) = cr.*( (1 - 1./(2.*ri(2:nr,2:nz))).*v(1:nr-1,2:nz) + (1 + 1./(2.*ri(2:nr,2:nz))).*v(3:nr+1,2:nz) ) + cz.*( v(2:nr,1:nz-1) + v(2:nr,3:nz+1) );    
    % right boundary updates
    v(nr+1,2:nz) = cr.*2.*v(nr,2:nz) + cz.*( v(nr+1,1:nz-1) + v(nr+1,3:nz+1) );
    % reapply grid potentials
    v(32:nr+1,30:40) = 1000;
    v(20:nr+1,55:65) = -200;

    % check for convergence
    max_v_diff = max(max(abs(v - v_old)));

end
toc

You can start by profiling the code and identifying the bottleneck(s). — tiago, Jul 10 '13 at 21:41
If you're looking to improve performance, have you considered Cython (Python-like code compiled into C) or Numba (JIT compilation using LLVM)? Here is an interesting comparison: http://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/ — Amro, Jul 10 '13 at 23:13
I should mention that this implementation is actually Jacobi instead of Gauss-Seidel, and that there are applied potentials (but these could be removed very easily). — nicholls, Jul 11 '13 at 15:21

score 2 · Answer 1 · answered Jul 10 '13 at 21:53

2

On my laptop your code runs in about 45 seconds. By trying to reduce creation of intermediate arrays to the bare minimum, including reuse of pre-allocated work arrays, I have managed to reduce that time to 27 seconds. That should put you back at the level of MATLAB, but your code would be less readable. Anyway, find below code to replace everything below your # Gauss-Seidel solver comment:

# work arrays
v_old = np.empty_like(v)
w1 = np.empty_like(v[0, 1:nz])
w2 = np.empty_like(v[1:nr,1:nz])
w3 = np.empty_like(v[nr, 1:nz])

# constants
A = cr * (1 - 1/(2*ri[1:nr,1:nz]))
B = cr * (1 + 1/(2*ri[1:nr,1:nz]))

# Gauss-Seidel solver
tic = time.time()
max_v_diff = 1;
while (max_v_diff > tol):

    v_old[:] = v

    # left boundary updates
    np.add(v_old[0, 0:nz-1], v_old[0, 2:nz+2], out=v[0, 1:nz])
    v[0, 1:nz] *= cz
    np.multiply(2*cr, v_old[1, 1:nz], out=w1)
    v[0, 1:nz] += w1
    # internal updates
    np.add(v_old[1:nr, 0:nz-1], v_old[1:nr, 2:nz+1], out=v[1:nr, 1:nz])
    v[1:nr,1:nz] *= cz
    np.multiply(A, v_old[0:nr-1, 1:nz], out=w2)
    v[1:nr,1:nz] += w2
    np.multiply(B, v_old[2:nr+1, 1:nz], out=w2)
    v[1:nr,1:nz] += w2
    # right boundary updates
    np.add(v_old[nr, 0:nz-1], v_old[nr, 2:nz+1], out=v[nr, 1:nz])
    v[nr, 1:nz] *= cz
    np.multiply(2*cr, v_old[nr-1, 1:nz], out=w3)
    v[nr,1:nz] += w3
    # reapply grid potentials
    v[31:,29:40] = 1000
    v[19:,54:65] = -200

    # check for convergence
    v_old -= v
    max_v_diff = np.absolute(v_old).max()

toc = time.time() - tic

answered Jul 10 '13 at 21:53

Jaime

65,696
17
124
159

1

This is a really nice way to show how expensive temporary arrays can be. It may lead to code that is less readable, but I'll try to keep in mind the possibilities of using the `out=...` argument. – jorgeca Jul 10 '13 at 23:40
Yeah, http://stackoverflow.com/questions/17527340/more-efficient-way-to-calculate-distance-in-numpy kinda shows the same problem, (and also got an interesting reply by jaime ;-) ), that as soon as you start using a lot of ram on temporary arrays, your calculation times rise very fast – usethedeathstar Jul 11 '13 at 06:31
1

Actually, by using only the trick with A and B the computing time on my machine goes from 55s to 28s. (Note that doing the same on Matlab would also reduce the computing time.) All the other stuff with temporary arrays is slower on my machine. – J. Martinot-Lagarde Jul 11 '13 at 14:17
OK, so the A and B method offers drastic improvement: 30s -> 16.2s in Python and in Matlab 20s -> 13s. This is making Python catch up! However, Jaime, the code you provided using the temporary arrays is actually slower as the previous commenter suggested. Using this method makes Python simulate in 21.8s. – nicholls Jul 11 '13 at 15:32
1

That's a little puzzling... Not recalculating `A` and `B` over and over again has a clear advantage, but there is something not quite right in numpy. Try timing the following to see what I mean: `a = np.arange(1000); b = np.arange(1000); c = np.empty_like(a); %timeit a + b (100000 loops, best of 3: 1.97 us per loop); %timeit np.add(a, b) (1000000 loops, best of 3: 2.94 us per loop); timeit np.add(a, b, out=c) (10000 loops, best of 3: 2.22 us per loop)`. `a + b` is supposed to be shorthand for `np.add(a, b)`, it's hard to understand where the overhead comes from... – Jaime Jul 11 '13 at 16:02
I got timings of 1.65 us, 1.63 us and 2.24 us respectively on my computer. – nicholls Jul 11 '13 at 16:12
timit `a+b` vs `np.add` : With just 1000 elements the extra 2 namespace lookups for `np` in globals and `add` in `np` count. Setup local `add = np.add` and timing is even a little faster vice versa in average here. Because `a+b` virtual operator call is a little more expensive than local function call. – kxr Aug 29 '16 at 17:38

score 2 · Accepted Answer · answered Jul 10 '13 at 23:35

I've been able to reduce the running time in my laptop from 66 to 21 seconds by following this process:

Find the bottleneck. I profiled the code using line_profiler from the IPython console to find the lines that took most time. It turned out that over 80% of the time was spent in the line that does "internal updates".
Choose a way to optimise it. There are several tools to speed code up in numpy (Cython, numexpr, weave...). In particular, scipy.weave.blitz is well suited to compile numpy expressions, like the offending line, into fast code. In theory, that line could be wrapped inside "..." and executed as weave.blitz("...") but the array that's being updated is used in the computation, so as stated by point #4 in the docs a temporary array must be used to keep the same result:
```
expr = "temp = cr*((1 - 1/(2*ri[1:nr,1:nz]))*v[0:nr-1,1:nz] + (1 + 1/(2*ri[1:nr,1:nz]))*v[2:nr+1,1:nz]) + cz*(v[1:nr,0:nz-1] + v[1:nr,2:nz+1]); v[1:nr,1:nz] = temp"
temp = np.empty((nr-1, nz-1))
...
while ...
    # internal updates
    weave.blitz(expr)
```
After checking that the results are correct, runtime checks are disabled by using weave.blitz(expr, check_size=0). The code now runs in 34 seconds.
Building up on Jaime's work, precompute the constant factors A and B in the expression. The code runs in 21 seconds (with minimal changes but it now needs a compiler).

This is the core of the code:

from scipy import weave

# [...] Set up code till "# Gauss-Seidel solver"

tic = time.time()
max_v_diff = 1;
A = cr * (1 - 1/(2*ri[1:nr,1:nz]))
B = cr * (1 + 1/(2*ri[1:nr,1:nz]))
expr = "temp = A*v[0:nr-1,1:nz] + B*v[2:nr+1,1:nz] + cz*(v[1:nr,0:nz-1] + v[1:nr,2:nz+1]); v[1:nr,1:nz] = temp"
temp = np.empty((nr-1, nz-1))
while (max_v_diff > tol):
    v_old = v.copy()
    # left boundary updates
    v[0,1:nz] = cr*2*v[1,1:nz] + cz*(v[0,0:nz-1] + v[0,2:nz+2])
    # internal updates
    weave.blitz(expr, check_size=0)
    # right boundary updates
    v[nr,1:nz] = cr*2*v[nr-1,1:nz] + cz*(v[nr,0:nz-1] + v[nr,2:nz+1])
    # reapply grid potentials
    v[31:,29:40] = 1000
    v[19:,54:65] = -200
    # check for convergence
    v_diff = v - v_old
    max_v_diff = np.absolute(v_diff).max()
toc = time.time() - tic

When doing it the other way around, using the constants I get 28s instead of 55s, and by using weave I go down to 24s. The main improvement is actually by using the constants ! — J. Martinot-Lagarde, Jul 11 '13 at 14:25
@J.Martinot-Lagarde That's interesting: I hadn't tested that combination and it's indeed fast. Though, in my laptop the final weave version is 30% faster instead of just 15% faster like in your timings. — jorgeca, Jul 11 '13 at 14:46
By the way, can you check whether Matlab is actually faster when using A and B? Maybe it can recognise that they don't change? — jorgeca, Jul 11 '13 at 14:56
I implemented the A and B change in Matlab and the timings went from 20s to 13s, a definite improvement. — nicholls, Jul 11 '13 at 16:06
@nicholls Thanks. What are your timings for my solution? Based on the speed up I observed and your original timing, it should run in ~30 * (21/66) = ~10 s, on par with Matlab. Can you please add the matlab code to your question for reference? — jorgeca, Jul 11 '13 at 16:53
@jorgeca Timings for your solution give 14s flat, which is the fastest solution yet. I'll edit my question and add the MATLAB code. — nicholls, Jul 11 '13 at 17:15
Thank you for the code. Now ~40% of the time is spent running `weave.blitz` and ~30% checking for convergence, so an easy way to make it faster (when you expect many iterations) is to check convergence only every other iteration. — jorgeca, Jul 11 '13 at 18:05

Improving Numpy speed for Gauss-Seidel (Jacobi) Solver

2 Answers2

Linked