1

I have a matrix of distance between 12000 atoms (pairwise euclidean distance). And I want to convert that to a list of nodes adjacency, with the ith element of the list being a list of nodes that are within a threshold distance. For example if I have three points :

(0,0) (1,0) (1,1)

I will have the matrix :

[[0.         1.         1.41421356]
 [1.         0.         1.        ]
 [1.41421356 1.         0.        ]]

Then all pairs that satisfy the condition; distance <= 1; will be:

[[0 0]
 [0 1]
 [1 0]
 [1 1]
 [1 2]
 [2 1]
 [2 2]]

Then finally the adjacency list will be:

[[0,1],[0,1,2],[1,2]]

Here is a code that work:

from scipy.spatial import distance
import numpy as np

def voisinage(xyz):
    #xyz is a vector of positions in 3d space

    # matrice de distance
    dist = distance.cdist(xyz,xyz,'euclidean')

    # extract i,j pairs where distance < threshold
    paires = np.argwhere(dist<threshold)

    # prepare the adjacency list
    Vvoisinage = [[] for i in range(len(xyz))]

    # fill the adjacency list
    for p in paires:
        Vvoisinage[p[0]].append(p[1])

This code runs in approximately 4 to 5 seconds for around 12100 points in 3D space. I want to make it as fast as possible because it needs to run for thousands of sets of 12100 points, and there are other calculations for each set. I have tried to use networkX but it's a lot slower than this method.

The section to optimize is the last one because it takes in average 2.7 seconds so half the computation time.

Also maybe there is a faster way of doing all of this.

Thanks

ravenspoint
  • 19,093
  • 6
  • 57
  • 103
Jacques
  • 164
  • 8
  • I cannot reproduce the result you got with the function you wrote. The result i get is `[[0], [1], [2]]`. Can you explain how do you move from the pairs that satisfy the condition to the adjacency table? – inarighas Jun 21 '22 at 14:01
  • Did you try to change to dist<=1. Also I didn't include the origin node, I will corect that in the post – Jacques Jun 21 '22 at 14:04
  • yes and i get this `[[0, 1], [0, 1, 2], [1, 2]]` which stills different – inarighas Jun 21 '22 at 14:05
  • can you just explain better what does the adjacency list mean ? is it just the indices of points that are neighbors of some given point ? – inarighas Jun 21 '22 at 14:06
  • Adjacency list is for a node i, all nodes that are at a distance < treshold – Jacques Jun 21 '22 at 14:07
  • Okay. I will write you an answer – inarighas Jun 21 '22 at 14:08
  • 1
    Calculating the Euclidian distance involves extracting a square root for every pair, which is expensive. A cheap optimization is to calculate the square of the distance, then compare with the square of your distance limit. – ravenspoint Jun 21 '22 at 15:11
  • Would you do that with a list comprehension or do you know a faster way? – Jacques Jun 21 '22 at 15:27
  • @Jacques Who is the question in your last post addressed to? If it was me, then I do not understand the question. – ravenspoint Jun 21 '22 at 15:33
  • To calculate the square of the distance, what would you use? I tried first by doing a list comprehension like this: dist = [ [ [(xyz[i][0]-xyz[j][0])**2+(xyz[i][1]-xyz[j][1])**2+(xyz[i][2]-xyz[j][2])**2] for j in range(natm)] for i in range(natm)] But it is really slow compared by the cdistance function that i use – Jacques Jun 21 '22 at 15:37
  • I also used the mode 'sqeuclidean' witch computes the square distance, and it's almost 70% faster so I might use that – Jacques Jun 21 '22 at 15:43
  • @Jacques 'sqeuclidean' is the way to go. Remember to compare with the square of your limit distance. – ravenspoint Jun 21 '22 at 15:48

2 Answers2

2

First the diagonal of you distance matrix is not that useful since it's always equal to zero. To make your procedure faster, I used only numpy functions since they are usually faster than vanilla python list operations when dealing with arrays and matrices.

so first ignored the dist matrix diagonal by setting to np.nan Then, I grouped paires by first index (see Is there any numpy group by function?).

Here is my code:

from scipy.spatial import distance
import numpy as np

xyz = np.array([[0,0],[1,0],[1,1]])

threshold = 1

# distance matrix
dist = distance.cdist(xyz,xyz,'euclidean')

# ignore diagonal values
np.fill_diagonal(dist, np.nan)

# extract i,j pairs where distance < threshold
paires = np.argwhere(dist<=threshold)

# groupby index
tmp = np.unique(paires[:, 0], return_index=True)
neighbors = np.split(paires[:,1], tmp[1])[1:]
indices = tmp[0]

The output corresponds to a list of lists, such as each list corresponds to nodes that are neighbors to the node corresponding to the index.

In terms of quatitative performance (on my computer ofc), your function takes ~4.5s on a 12000 random generated points whereas mine takes ~1.3s.

inarighas
  • 720
  • 5
  • 24
2
  1. As discussed in comments, a cheap optimization is to compare the squares of distances. This gives the same result and avoids extracting square roots.

  2. As discussed by @inarighas it is redundant to calculate the full matrix. You can avoid calculating the leading diagonal and the entire upper triangle. This will double your performance.

  3. If you are interested in performance, you should not use an interpreted language like python. A compiled language such as C++ can give you as much as a 50 times performance boost. Here is the C++ code for your problem.

    class cLoc
    {
    public:
        float x, y, z;
        cLoc(float X, float Y, float Z)
            : x(X), y(Y), z(Z)
        {
        }
        float dist2(const cLoc &other) const
        {
            float dx = x - other.x;
            float dy = y - other.y;
            float dz = z - other.z;
            return dx *dx + dy *dy + dz *dz;
        }
    };
    
    class cAtoms
    {
    public:
        std::vector<cLoc> myAtomLocs;
        std::vector<std::vector<int>> myClose;
        float myMaxDist2;
    
        void generateTest1();
        void neighbors();
        std::string text();
    };
    
    void cAtoms::generateTest1()
    {
        myAtomLocs = {
            cLoc(0, 0, 0),
            cLoc(1, 0, 0),
            cLoc(1, 1, 0)};
        myMaxDist2 = 1;
    }
    
    void cAtoms::neighbors()
    {
        for (int ks = 0; ks < myAtomLocs.size(); ks++)
        {
            std::vector<int> v;
            v.push_back(ks);
            for (int kd = ks + 1; kd < myAtomLocs.size(); kd++)
            {
                if (myAtomLocs[ks].dist2(myAtomLocs[kd]) <= myMaxDist2)
                    v.push_back(kd);
            }
            myClose.push_back(v);
        }
    }
    
    std::string cAtoms::text()
    {
        std::stringstream ss;
        for( auto& v : myClose ) {
            if( !v.size() )
                continue;
            ss << v[0] <<": ";
            for( int k = 1; k < v.size(); k++ )
            {
                ss << v[k] << " ";
            }
            ss << "\n";
        }
        return ss.str();
    }
    main()
    {
            cAtoms myAtoms;
            myAtoms.generateTest1();
            myAtoms.neighbors();
            std::cout << myAtoms.text();
    }

output is

0: 1 
1: 2

If I generate 12,100 atoms as follows

 void cAtoms::generateRandom()
 {
    srand(time(NULL));

    for( int k = 0; k < 12100; k++ ) {
        myAtomLocs.push_back( cLoc(
            (rand() % 100 ) / 10.0f,
            (rand() % 100 ) / 10.0f,
            (rand() % 100 ) / 10.0f
           ));
    }
    myMaxDist2 = 1;
 }

then the neighbors() method runs in 125 milliseconds on my laptop.

Code for the complete application at https://github.com/JamesBremner/atoms

ravenspoint
  • 19,093
  • 6
  • 57
  • 103