1

After a bunch of distance-wise computation for specifying neighbors of every single atom, I end up with the following neighbor table (First column for the atom itself, second for its neighbor):

array([[ 0,  1],
       [ 1,  0],
       [ 1,  2],
       [ 2,  1],
       [ 2,  3],
       [ 3,  2],
       [ 3,  4],
       [ 4,  3],
       [ 4,  5],
          ...
       [48, 47],
       [48, 49],
       [49, 48]])

For instance, the 0th atom has only one neighbor, which is indexed by 1 (it's the meaning of the 0th row). The second atom, which is indexed by 1, has two neighbors indexed by 0 and 2 since the number 1 is in between them. It goes like that, and at the end, as there is no atom indexed by a number greater than 49, the last atom has only one neighbor just like the 0th atom, and that neighbor is the atom indexed by the number 48.

What I want is to alter this array in a way that every row refers to only one atom and its neighbors, such that:

array([[ 0,  1],
       [ 1,  0, 2],
       [ 2,  1, 3],
       [ 3,  2, 4],
       [ 4,  3, 5],
          ...
       [48, 47, 49],
       [49, 48]])

where the first column refers to atoms themselves, and the rest of the columns refer to their whole neighbors.

Because the array will contain hundreds of thousands items, and that it will be called for thousands of times, I don't want to use a python loop. I'm searching for very efficient way of doing this. Moreover, the neighbors don't have to be one for the first and the last atoms, and two for the rest of the atoms; number of neighbors for an atom can change. Hence, some indexing methods probably won't work for this problem although it may work at first.

I thought about array manipulation methods, but I didn't manage to solve my problem. I'd be appreciated if you could guide me to solve this problem. Thank you.

Martin Brisiak
  • 3,872
  • 12
  • 37
  • 51
demosian
  • 11
  • 1
  • Iterating through your array a single time and creating a new array of desired type is O(n), which is not bad. I don’t think you can do better than that. – pakpe Jan 17 '21 at 15:28
  • @pakpe you are right, only that using for-loops in python is less efficient/performant then using compiled c-functions from numpy and pandas. – Marc Jan 17 '21 at 17:00
  • you can also check this thread: [is-there-any-numpy-group-by-function](https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function/43094244) – Marc Jan 17 '21 at 17:02

1 Answers1

0

This looks like a groupby-type operation, and NumPy doesn't have much built-in functionality for group-by operations, however pandas does.

Here's an example of doing this efficiently using a pandas groupby:

import numpy as np
import pandas as pd

neighbors = np.array([[ 0,  1],
                      [ 1,  0],
                      [ 1,  2],
                      [ 2,  1],
                      [ 2,  3],
                      [ 3,  2],
                      [ 3,  4],
                      [ 4,  3],
                      [ 4,  5],
                      [48, 47],
                      [48, 49],
                      [49, 48]])

g = pd.Series(neighbors[:, 1]).groupby(neighbors[:, 0]).apply(list)
grouped = pd.DataFrame(g.to_list(), index=g.index).reset_index().to_numpy()

print(grouped)
# array([[ 0.,  1., nan],
#        [ 1.,  0.,  2.],
#        [ 2.,  1.,  3.],
#        [ 3.,  2.,  4.],
#        [ 4.,  3.,  5.],
#        [48., 47., 49.],
#        [49., 48., nan]])

Note that numpy cannot have heterogeneous row lengths in a single array; here pandas uses np.nan as a fill value for missing entries.

jakevdp
  • 77,104
  • 11
  • 125
  • 160