Find rows which share common elements

Question

I have a 2D numpy array in which each row contains 2 integers. I want to find all the groups of elements belonging to rows which share common elements (a.k.a connected components of a graph from the edgelist array). For example, for the array:

[[ 0  4]
 [ 0  7]
 [ 1  2]
 [ 1 13]
 [ 2  1]
 [ 2  9]
 [ 3 14]
 [ 3 16]
 [ 4  0]
 [ 4  5]
 [ 5  4]
 [ 5  6]
 [ 6  5]
 [ 6  7]
 [ 7  0]
 [ 7  6]]

would contain the groups

[[ 0  4  5  6  7]
 [ 1  2 13  9]
 [ 3 14 16]]

Are you asking: given a list of edges how to find the connected components? Are your edges directed? Please edit the question to clarify. — myrtlecat, Sep 25 '20 at 23:01
I'm afraid I'm not too aware of graph theory terminology, however, I believe Ehsan has edited the question to clarify what I was saying. — bs15ansj, Sep 27 '20 at 10:03

Ehsan · Answer 1 · 2020-09-25T23:56:27.287

2

If you can use libraries, assuming your array is a (Note that you cannot have components as numpy array since they can be non-rectangular array which does not exist in numpy, so this outputs them as sets):

import networkx as nx
G=nx.Graph()
G.add_edges_from(a)
print(sorted(nx.connected_components(G), key = len, reverse=True))
#[{0, 4, 5, 6, 7}, {1, 2, 13, 9}, {16, 3, 14}]

And if you need a pure numpyic solution without extra libraries, please check out this generalized solution: https://stackoverflow.com/a/61764414/4975981

edited Sep 25 '20 at 23:56

answered Sep 25 '20 at 23:53

Ehsan

12,072
2
20
33

1

This is using [graph theory connect components](https://en.wikipedia.org/wiki/Component_(graph_theory)) – Scott Boston Sep 25 '20 at 23:55
@Ehsan Don't know fully a meaning of `numpyic` but there's no vectorized solution in your reference. I've managed to make it in `C-level` (almost) but still wondering if pure vectorized `numpy` version exists there. – mathfux Sep 26 '20 at 02:11
Thanks for the suggestion, clearly the most simple response but yes, I was hoping for a pure numpy answer. The best I could get myself was iterating through rows and if the elements are not already found using `A[np.where(np.isin(A,row))[0]]` then adding to a list. However, since the edges between nodes are form pentameric connected graphs, the fifth node (two away from the nodes in the row) would always be left out and the above line would need to be called again on the four found nodes of the set. Sorry if my terminology is incorrect, I'm a biologist working on viral capsids! – bs15ansj Sep 27 '20 at 10:14
@mathfux Although the linked answer uses vectorized approach to some extent, I cannot see where I claimed it to be. I simply referred to it as a solution that uses only numpy(hence Numpyic to some extent) and not an external library. I am not aware of an efficient solution that is fully vectorized. Thank you for your answer post. I suspect igraph uses DFS to find clusters. – Ehsan Sep 28 '20 at 18:17
@bs15ansj You are welcome. I will add the linked answer here, although I am not sure if it would be faster than networkx. It also depends on how dense your graph is. Is your graph considered sparse (meaning the number of rows in your array is of the same order as the maximum number of unique elements in your array)? – Ehsan Sep 28 '20 at 18:19

score 1 · Answer 2 · answered Sep 25 '20 at 23:37

I am sure there are faster ways and I do not study graph theory but you can start with this;

x = [[ 0,  4],
 [ 0,  7],
 [ 1,  2],
 [ 1, 13],
 [ 2,  1],
 [ 2,  9],
 [ 3, 14],
 [ 3, 16],
 [ 4,  0],
 [ 4,  5],
 [ 5,  4],
 [ 5,  6],
 [ 6,  5],
 [ 6,  7],
 [ 7,  0],
 [ 7,  6]]

nodes = list(set([item for sublist in x for item in sublist]))
grps = {n: g for n, g in zip(nodes, range(len(nodes)))}

for v in x:
    t = grps[v[0]]
    f = grps[v[1]]
    if t != f:
        for n in grps:
            if grps[n] == f:
                grps[n] = t

ret = [[k for k, v in grps.items() if v==g] for g in set(grps.values())]
print(ret)

score 1 · Answer 3 · answered Sep 26 '20 at 01:42

In terms of graph theory you need to create a graph from an array of edges and then find connected components of this graph. Pure numpy based solution seems too hard for me but you still can make it C level using igraph which is written in C (unlike networkx which is pure Python). You need to install python-igraph first.

Trivial case

igraph.Graph.clusters() method returns a special instance of igraph.clustering.VertexClustering class which can be converted to list:

import igraph
arr = np.array([[0, 4], [0, 7], [1, 2], [1, 9], [2, 1], [2, 8], [3, 10], 
                [3, 11], [4, 0], [4, 5], [5, 4], [5, 6], [6, 5], [6, 7], [7, 0], [7, 6]])
g = ig.Graph()
g.add_vertices(12)
g.add_edges(arr)
i = g.clusters() 
print(list(i))
#output: [[0, 4, 5, 6, 7], [1, 2, 8, 9], [3, 10, 11]]

igraph also supports plotting these connected components like it's done in networkx but you might need to download pycairo from unofficial binaries and install it in order to unlock igraph.plot option:

pal = ig.drawing.colors.ClusterColoringPalette(len(i)) #passing a number of colors 
color = pal.get_many(i.membership) # a list of color codes for each vertex
ig.plot(g,  bbox = (200, 100), vertex_label=g.vs.indices,
        vertex_color = color, vertex_size = 12, vertex_label_size = 8)

General case

Notice that igraph throws an InternalError if initial array is used instead of trivial one. That's because every vertex should be declared before adding edges and all the vertices are not allowed to have missing numbers (in fact, it's allowed but reindexation is done silently and old names can be accessed using 'name' attribute). This issue can be fixed writting a custom function that creates a graph from relabelled edges:

def create_from_edges(edgelist):
    g = ig.Graph()
    u, inv = np.unique(edgelist, return_inverse=True)
    e = inv.reshape(edgelist.shape)
    g.add_vertices(u) #add vertices, not reindexed
    g.add_edges(e) #add edges, reindexed
    return g

arr = np.array([[0, 4], [0, 7], [1, 2], [1, 13], [2, 1], [2, 9], [3, 14], 
                [3, 16], [4, 0], [4, 5], [5, 4], [5, 6], [6, 5], [6, 7], [7, 0], [7, 6]])
g = create_from_edges(arr)
i = g.clusters()
print(list(i))
#output: [[0, 4, 5, 6, 7], [1, 2, 8, 9], [3, 10, 11]]

New labels were used in output (thus making it incorrect) but it's still possible to access old ones like so:

print('new_names:', g.vs.indices)
print('old_names:', g.vs['name'])
>>> new_names: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> old_names: [0, 1, 2, 3, 4, 5, 6, 7, 9, 13, 14, 16]

They can be used to preview original graph (vertex_label is different now):

pal = ig.drawing.colors.ClusterColoringPalette(len(i)) #passing a number of colors 
color = pal.get_many(i.membership) ##a list of color codes for each vertex
ig.plot(g,  bbox = (200, 100), vertex_label=g.vs['name'], 
        vertex_color = color, vertex_size = 12, vertex_label_size = 8)

Finally, you need to use old names of vertices in order to fix output. It can be done like so:

output = list(i)
old_names = np.array(g.vs['name'])
fixed_output = [old_names[n].tolist() for n in output]
#new output: [[0, 4, 5, 6, 7], [1, 2, 9, 13], [3, 14, 16]]

Find rows which share common elements

3 Answers3

Trivial case

General case