Improve speed of summing up loop in python

Question

What can I do to speed up the execution of the following code significantly (in particular for even higher values of N). I think the main problem is the sum in the get_field function.

from numpy import *
from numpy.linalg import *
from scipy import constants as co
import pylab as plt

eps0 = co.epsilon_0


charges = []


class charge:
    def __init__(self, q, pos):
        self.q=q
        self.pos=pos

a = 10
N = 1001

d = 1e-3
Q = 1e-8
q = Q/N**2

for x in linspace(0,a,N):
    for y in linspace(0,a,N):
        ## First plate
        position = array([x,y,0])
        NewCharge = charge(q,position)
        charges.append(NewCharge)

        ## Second Plate
        position2 = array([x,y,d])
        NewCharge2 = charge(-q,position2)
        charges.append(NewCharge2)

def get_field(position):
    E = array([0.,0.,0.])
    
    for C in charges:
        q = C.q
        rVector = position - C.pos
        r = norm(rVector)
        r0Vector = rVector/r
        
        E += 1/(4*pi*eps0)*q/r**2 * r0Vector
        #print(E)
    return E[2]




p = 0.1
x0 = 0.5*a
y0 = 0.5*a


#print(get_field([x0,y0,1]))

Z = linspace(p*d,(1-p)*d,100)

E = [get_field([x0,y0,z]) for z in Z]

print(Z)
print(E)
plt.plot(Z,E,'x')
plt.ylim(bottom=0)
plt.show()

The background is a physics problem where I want to find the electric field of a parallel plate capacitor by explicitly summing up the coulomb fields of evenly distributed point charges on two parallel plates.

Update @Blckknght's answer improves the performance of my code significantly, for example for N = 100, my original code takes approx 90 seconds on my system and your code 0,6 seconds. For N = 500 my code takes approx 2300 seconds, Blckknght's code approx 3 seconds.

The problem with this improved code is that the memory consumption is so high that in my case the program just gets killed if my 16 GB memory are exceeded.

So is there any way to get those optimizations without running into the memory problem at all?

Does this answer your question? [How to parallel sum a loop using multiprocessing in Python](https://stackoverflow.com/questions/29785427/how-to-parallel-sum-a-loop-using-multiprocessing-in-python) — Roshin Raphel, Mar 02 '23 at 20:21
This can be sped up immensely if you move move things into numpy arrays, rather than using Python lists and for loops. — Nelewout, Mar 02 '23 at 20:29
@RoshinRaphel: Yes, parallelization would speed it up by the number of cores. I already played with numba for this. However for larger and larger N this wouldn't help to much, so I am looking for further significant optimizations. — dp21, Mar 02 '23 at 20:36
`r = norm(rVector)` is slow, you can try using a dictionary for reducing the number of calls to this function if there are redundant values. — Roshin Raphel, Mar 02 '23 at 20:42
@RoshinRaphel: Can you explain how to use a dict in this case a bit more in detail? — dp21, Mar 02 '23 at 20:48
Whenever you call this function, save the result in a dict as `dict[rVector] = r`, next time, just check if the value exists in dict, before calling the function, if you have significant collisions, this should reduce the running time. — Roshin Raphel, Mar 02 '23 at 20:52
I'm pretty sure all the norms are on unique vectors, so caching with a dictionary is not going to help any. A better trick is to use a single array for the vectors, and call `norm` with an `axis` parameter, so that it can do fancy CPU-level optimizations on the many parallel calculations. — Blckknght, Mar 02 '23 at 21:40

score 2 · Answer 1 · answered Mar 03 '23 at 03:17

Programming fast numerical code with numpy is quite different than writing normal Python code. You often need to discard some best practices from normal code to get good numerical performance.

In this context, I think the thing to discard is using object oriented programming, with your Charge class. Using a single numpy array to contain all the coordinates of all the charges is going to be much more efficient than having a separate array for each point, as the library can use fancy parallel CPU instructions to do all the computation in one go, rather than charge by charge in a slow pure-Python loop.

Here's my crude solution, which builds up a few different arrays, one with the coordinates of the charges on the two plates, one for the charge values (q and -q), and then one for the points that vary over the Z axis where you want to sample the field. I use various numpy broadcasting techniques to combine them.

import numpy as np
from scipy.constants import pi, epsilon_0 as eps0

a = 10
N = 1001

d = 1e-3
Q = 1e-8
q = Q/N**2

charge_coords = np.mgrid[0:a:N*1j, 0:a:N*1j, 0:d:2j].reshape(3,-1).T
charge_values = np.array([q, -q] * N**2).reshape(1,2*N**2,1)

p = 0.1
x0 = 0.5*a
y0 = 0.5*a

test_coords = np.linspace([x0, y0, p*d], [x0, y0, (1-p)*d], 100)

rVectors = test_coords[:, None,:] - charge_coords[None,:,:]
rMagnitudes = np.linalg.norm(rVectors, axis=2, keepdims=True)

E = np.sum(1/(4*pi*eps0) * charge_values / rMagnitudes**3 * rVectors, axis=1)

The funny indexing of np.mgrid with imaginary step values is based on this answer which I found by googling. You can also use np.meshgrid, it just takes some extra steps. The rest is pretty standard numpy logic, you just need to reshape or slice to add dimensions a few times to get the arrays to broadcast together properly in the final step.

The biggest downside of using big numpy arrays like this is memory usage. Since we're taking the Cartesian product of the 100 test coordinates with the 2 million-ish charge coordinates, we wind up with several gigabytes of memory being used all at once (about 5 GB for the largest array on my system). That could get slow if you run out of real RAM and your computer starts needing to swap things out of virtual memory. It might be possible to save a little memory here or there by deleting references to arrays we don't need any more (or not keeping persistent references to temporary intermediate values at all). I avoided making a separate variable for the unit vectors by just combining the division by the magnitude with the existing magnitude squared that the charge was being divided by.

Thanks. This contains some good ideas, however as you already wrote the memory consumption is gigantic. For N = 1001 it just exceeded my 16GB memory of my laptop (and then zsh killed the process; don't really see why). — dp21, Mar 03 '23 at 06:36
... because out of memory: Out of memory: Killed process 67754 (python3) total-vm:13142088kB, anon-rss:11721868kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:23116kB oom_score_adj:0 — dp21, Mar 03 '23 at 06:53
Sorry it's not working for you. I also have 16GB of ram, and while Python uses most of it running this code, it doesn't quite use so much that the interpreter gets killed. You may want to try it with smaller `N` values to see if it's computing the expected results at lower resolution, before exploring ways to save memory. — Blckknght, Mar 03 '23 at 07:02
So a few ideas that might let you reduce the memory footprint: Put the `np.mgrid` expression directly into the calculation of `rVectors`, rather than making a `charge_coords` variable. That way its memory can get reclaimed as soon as it's no longer needed. You might be able to do the same with `charge_values` and `test_coords` in the final computation, but those are both smaller arrays, so they will have a smaller impact. — Blckknght, Mar 04 '23 at 17:27
thanks. I will try it. Do you think it would also help to use python generators in some clever way here? Or maybe cutting the arrays in memorywise managable pieces, calculate E for every piece and sum it up at the end. — dp21, Mar 05 '23 at 19:54

score 0 · Answer 2 · answered Apr 05 '23 at 09:44

If you can sacrifice some precision, you can convert the arrays to float32. This reduces the memory footprint and also it runs 3x faster on my HW. Judging by the notebook's kernel memory monitor, there's still a peak of ~4GB.

Updated code from @Blckknght's answer:

import numpy as np
from scipy.constants import pi, epsilon_0 as eps0

a = 10
N = 1001

d = 1e-3
Q = 1e-8
q = Q/N**2

charge_coords = np.mgrid[0:a:N*1j, 0:a:N*1j, 0:d:2j].reshape(3,-1).T.astype(dtype=np.float32)
charge_values = np.array([q, -q] * N**2, dtype=np.float32).reshape(1,2*N**2,1)

p = 0.1
x0 = 0.5*a
y0 = 0.5*a

test_coords = np.linspace([x0, y0, p*d], [x0, y0, (1-p)*d], 100, dtype=np.float32)

rVectors = test_coords[:, None,:] - charge_coords[None,:,:]
rMagnitudes = np.linalg.norm(rVectors, axis=2, keepdims=True)

E = np.sum(1/(4*pi*eps0) * charge_values / rMagnitudes**3 * rVectors, axis=1)
print([a.nbytes for a in (charge_coords, charge_values, test_coords, rVectors, rMagnitudes, E)])

Improve speed of summing up loop in python

2 Answers2