Speed up list creation for large arrays in Python

Question

My system is composed of 12000 atoms (3D points), my purpose is to compute quantities on theses atoms.

For that I have informations on each atoms in a dict object (like type, charge and various parameters) a list of neighbors for each atoms (giving all atoms indexes at a distance < threshold), and a distance matrix between all atoms.

Quantities to compute are potentials between a given atom and all of it's neighbors (from the list of neighbors). It require the distance between atoms, the informations and parameters and the list of neighbors for a given atom.

I proceed with this code:

def LJ(sii,sjj,Eii,Ejj,rij):
    """
    Fonction that compute the LJ potential. one of the quantities

    sii : sigma for atom i ( a parameter )
    sjj : sigma for atom j ( a parameter )
    Eii : epsilon for atom i ( a parameter )
    Ejj : epsilon for atom j ( a parameter )
    rij : squared distance between atom i and j 
    """
    sij = (sii+sjj)*0.5
    Eij = (Eii*Ejj)**0.5
    return ((4*Eij*(sij**12))/(rij**6)) - ((4*Eij*(sij**6))/(rij**3))

def Coul(qi,qj,rij):
    """
    Fonction that compute the Coul potential. another quantities
    qi : charge of atom i ( a parameter )
    qj : charge of atom j ( a parameter )
    rij : squared distance between atom i and j 
    """
    V = 138.935458*((qi*qj)/(rij**0.5)) # because rij is squared i take the square root
    return V

def energies(NL,sP,D):
    """
    NL : a vector of neighbors with as index the position and as value a list of indexes of neighbors
        for example [[1,2,3],[0,2]...] the first atom at position 0 as 3 neighbors 1,2 and 3. The second atom at position 1 as 2 neighbors 0 and 2 etc...
    sP : a distionnary giving the parameters for a given atoms, and also neighbors distant of 4 atoms in the mist 'dihedrals'. The key correspond to the index of the distance matric and the neighboring list, but it's a string
        sample : {'key': '0_NH3', 'num': 0, 'type': 'NH3', 'resnr': 0, 'residu': 'ASP', 'atom': 'N', 'charge': -0.3, 'sigma': 0.329632525712, 'epsilon': 0.8368, 'voisins': [1, 2, 3, 4, 5, 6, 12], 'dihedrals': [7, 8, 9, 13, 14]}
    D : the squared distance matrix between all atoms
    """
    
    coul,ljk,ljk14,coul14 = [],[],[],[] # I want the potential for each atom and I want 4 types of potentials
    # coul and ljk are potentials calculated on neighboring atoms in the NL neighbors list
    # coul14 and ljk14 are potentials calculated on neighboring atoms in the sP[key]['dihedrals'] list

    for k in sP.keys(): # For each atom 
        # k is a string that represent an int sP['0'] as the informations for D[0][:] or D[:][0] and NL[0] 
        ljk.append(sum([LJ(sP[k]['sigma'],sP[str(n)]['sigma'],sP[k]['epsilon'],sP[str(n)]['epsilon'],D[int(k)][n]) for n in NL[int(k)]]))
        coul.append(sum([Coul(sP[k]['charge'],sP[str(n)]['charge'],D[int(k)][n]) for n in NL[int(k)]]))
        ljk14.append(sum([LJ(sP[k]['sigma'],sP[str(n)]['sigma'],sP[k]['epsilon'],sP[str(n)]['epsilon'],D[int(k)][n]) for n in sP[k]['dihedrals']]))
        coul14.append(sum([Coul(sP[k]['charge'],sP[str(n)]['charge'],D[int(k)][n]) for n in sP[k]['dihedrals']]))
        # because I want the sum of all potentials for each atom I use a sum on a comprehensive list of potential between a given atom and it's neighbors

On my machine the code takes about 40 seconds to compute for 12134 atoms. I really have to reduce this number as much as possible. This is the best version I got, but I feel something can be done with numpy and vectorization, but I didn't find anything.

Thanks for the help

You'll almost certainly get better performance with numpy. It's written with optimized C code and takes advantage of parallel hardware. — Barmar, Jun 22 '22 at 16:56
What do you mean you "didn't find anything"? Numpy is the answer here but your question however can't just be "numpy?" Your question needs more focus. I suggest reading a tutorial on the basics of numpy and coming back with an actual question. — ddejohn, Jun 22 '22 at 16:56
It'll help if you provide sample data, the output you get with your code and time / performance benchmarks for others to take a shot at improving it — Mortz, Jun 22 '22 at 17:00
In any case regarding performance, start by profiling your code - find out what's actually taking time. How many neighbors do you have for each atom? Are there any of these calculations that are the same multiple times? (i.e. can it be preprocessed or cached with an lru_cache or something similar?) - numpy will definitely speed things up, but fundamental changes to the algorithms could give you a large efficiency boost ("the fastest code is code that never runs") — MatsLindh, Jun 22 '22 at 17:00
You can refer https://stackoverflow.com/questions/14452145/how-to-measure-time-taken-between-lines-of-code-in-python to understand the bottlenecks in your code as suggested by @MatsLindh — Rohit Babu, Jun 22 '22 at 17:09
The part that is taking the most time is the LJ function, and then the Coul function. If I replace the calculation by a more simple one like 1+1, the code runs in an instant. My question is about a restructuration of the code. I am working into numpy and then I will edit or repost If I find anything. — Jacques, Jun 23 '22 at 08:46

score 0 · Answer 1 · answered Jun 22 '22 at 17:40

One thing that should help right away is switching from using X**0.5 to math.sqrt(X). This link, Which is faster in Python: x**.5 or math.sqrt(x)?, shows the differences in functionality and speed between the two.

Additionally, you might look into the multiprocessing module, specifically multiprocessing.Pool(), which could help run your code on multiple CPU cores and decrease total run time. https://docs.python.org/3/library/multiprocessing.html

The last thing that might improve run time would be to save the values from the arrays/dicts you're reading from into local varibles. I'm not sure exactly how memory access or access speed works in python, but this could potentially reduce the number of CPU cycles required to read from memory. It is important to note that this will only be helpful if you are using the value more than once. Also, reading any variables that are not dependant on k should be moved outside of the for loop. For example:

sp_n_sigma = sP[str(n)]['sigma']
for k in sP.keys(): # For each atom 
        sP_k_sigma = sP[k]['sigma']
        ljk.append(sum([LJ(sP_k_sigma,sp_n_sigma, ...
        ljk14.append(sum([LJ(sP_k_sigma, sp_n_sigma, ...

Speed up list creation for large arrays in Python

1 Answers1