2

I have made an algorithm that works out what atoms are connected in a molecule from a list of atoms (given their distance from each other). To think of this problem outside of a physics context, it is simply a closed network problem whereby the nodes are the atoms and the edges are the atomic bonds connecting atoms. I have a list of nodes, and a list of the edges that connect the nodes, and I need to find out the list of each unique molecule. I have done this in the code below, however it's kinda slow and quite ugly. Is there a way to optimise this algorithm?

Here's my code that works, with the relevant information for you to try it (I'll provide another atom list to try called pairs_1 and chosen_atom_1, just change pairs_1 to pairs and chosen_atom_1 to chosen_atom for it to work of course)

pairs = [[0, 1],
 [0, 2],
 [3, 4],
 [3, 5],
 [6, 7],
 [6, 8],
 [9, 10],
 [9, 11],
 [12, 13],
 [12, 14],
 [15, 16],
 [15, 17],
 [18, 19],
 [18, 20],
 [21, 22],
 [21, 23],
 [24, 25],
 [24, 26],
 [27, 28],
 [27, 29],
 [30, 31],
 [30, 32],
 [33, 34],
 [33, 35],
 [36, 37],
 [36, 38],
 [39, 40],
 [39, 41],
 [42, 43],
 [42, 44],
 [45, 46],
 [45, 47]]

chosen_atom = [np.random.rand() for i in range(48)]

pairs_1 = [[0, 6],
 [1, 7],
 [2, 8],
 [3, 9],
 [4, 10],
 [5, 6],
 [5, 10],
 [6, 7],
 [7, 8],
 [8, 9],
 [9, 10]]

chosen_atom_1 = [np.random.rand() for i in range(11)]

# use list of lists to define unique molecules
molecule_list = []

for i in pairs:
    temp_array = []
    for ii in pairs:
        temp_pair = [i[0], i[1]]

        if temp_pair[0] == ii[0]:
            temp_array.append(ii[1])
            temp_array = set(temp_array)
            temp_array = list(temp_array)
        if temp_pair[1] == ii[1]:
            temp_array.append(ii[0])
            temp_array = set(temp_array)
            temp_array = list(temp_array)

        for iii in temp_array:
            for j in pairs:
                if iii == j[0]:
                    temp_array.append(j[1])
                    temp_array = set(temp_array)
                    temp_array = list(temp_array)
                if iii == j[1]:
                    temp_array.append(j[0])
                    temp_array = set(temp_array)
                    temp_array = list(temp_array)

        if len(temp_array) > len(chosen_atom):
            break
    molecule_list.append(temp_array)

molecule_list = [list(item) for item in set(tuple(row) for row in molecule_list)]

# the output of pairs should be
molecule_list = [[8, 6, 7],
 [27, 28, 29],
 [9, 10, 11],
 [0, 1, 2],
 [32, 30, 31],
 [18, 19, 20],
 [45, 46, 47],
 [33, 34, 35],
 [24, 25, 26],
 [42, 43, 44],
 [16, 17, 15],
 [12, 13, 14],
 [21, 22, 23],
 [3, 4, 5],
 [40, 41, 39],
 [36, 37, 38]]
# the output of pairs_1 should be:
molecule_list = [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]

So above I've given the outputs that I get now, and should get - any ideas to make this code nicer would be greatly appreciated.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • 2
    This is probably a picture perfect question for [CodeReview.SO](https://codereview.stackexchange.com/), and less so here where the focus is more towards issues and problem solving :) – Torxed Feb 17 '19 at 10:51
  • Ah nice, will post it there - thanks! – Gianluca Cientanni Feb 17 '19 at 10:54
  • What makes you think that the code is "ugly" ? –  Feb 17 '19 at 11:05
  • It isn't quite clear what problem you have solved, or are trying to solve. "I have a list of nodes, and a list of the edges that connect the nodes". That's called a graph. " I need to find out the list of each unique molecule." The jump from graphs to molecules isn't quite straightforward. Your graph represents what? A single molecule? A collection of unconnected molecules? – n. m. could be an AI Feb 17 '19 at 11:05
  • 3
    Look for the connected components algorithm. – Mykola Zotko Feb 17 '19 at 11:06
  • 2
    You can solve it easily with networkx package and some knowledge of graph theory. This the field of chemoinformatics. – Mykola Zotko Feb 17 '19 at 11:08
  • So a closed graph represents a single molecule. From a list of nodes and edges I need to find the list of nodes that constitute a unique closed graph. For example, if I have a list of nodes [1,2,3,4,5] and they are connected like so [[1,2], [2,3], [3,5]] then my list of unique molecules is [[1,2,3,5],[4]]. – Gianluca Cientanni Feb 17 '19 at 11:10
  • https://stackoverflow.com/questions/54673308/how-to-merge-sets-which-have-intersections-connected-components-algorithm – Mykola Zotko Feb 17 '19 at 11:20
  • 1
    What you call "closed graph" seems to be a connected component. So this is indeed the connected components search and there is about a bajillion ready-made solutions in Python. – n. m. could be an AI Feb 17 '19 at 11:22
  • Possible duplicate of [Union of sublists with common elements](https://stackoverflow.com/questions/53886120/union-of-sublists-with-common-elements) – yatu Feb 17 '19 at 12:14

2 Answers2

3

As I said in comments you need the connected components algorithm and you can solve it easily with the networkx package:

import networkx as nx

G = nx.from_edgelist(pairs)

print([i for i in nx.connected_components(G)])

# jupyter notebook
%matplotlib inline
nx.draw(G, with_labels=True)

Output:

[{0, 1, 2}, {3, 4, 5}, {8, 6, 7}, {9, 10, 11}, {12, 13, 14}, {16, 17, 15}, {18, 19, 20}, {21, 22, 23}, {24, 25, 26}, {27, 28, 29}, {32, 30, 31}, {33, 34, 35}, {36, 37, 38}, {40, 41, 39}, {42, 43, 44}, {45, 46, 47}]

Graph

Here is my script to construct and visualize a molecular graph from atomic coordinates.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
2

This can be solved using the "union-find" method.

Associate every atom a link to another atom in the same molecule. The link can be void. If not, it leads to another atom that has itself a link. Following the links recursively, you will eventually reach an atom with a void link. Let us call it the main atom in the molecule.

The algorithm works as follows:

  • take every edge in turn, let a-b;

  • find the main atom of a, let main(a);

  • link main(a) to b, to regroup them in the same molecule.

Doing this, main(a) becomes the same as main(b) and in the end all atoms of a given molecule have the same main.

After this first pass, you can perform a pass on the atoms to enumerate the mains, which correspond to the different molecules.

You can also add a third pass in which you reorganize the links so that every main will start a linked list with all the atoms of this molecule.


Variants and optimizations:

  • instead of linking main(a) to b, you can link it to main(b);

  • while you look for main(a), you can relink all atoms to main(a) on the way; this shortens the paths for future searches;

  • if you also store the length of the path from a to main(a), you can prefer to attach main(a) to main(b) or main(b) to main(a);

  • instead of letting a main atom link to void, you can let it link to itself.