3

I'm looking for an efficient connection grouping (I'm not sure this is proper name..) algorithm or implementation by python.

For example, I have this nested list:

connection_data = [
   ...:     ["A", "B", "C"],
   ...:     ["B", "D"],
   ...:     ["A", "C"],
   ...:     ["E", "F"],
   ...:     ["C", "D"],
   ...:     ]

This data means each list in the nested list shows connections. For example, the first connection ["A", "B", "C"] means A and B and C are having connection each other. The nested list has multiple connections information.

I would like to calculate connection groupings from a nested list. For example, when I have the upper connection_data, I would like to get

grouped_connection = [
   ...:     ["A", "B", "C", "D"],
   ...:     ["E", "F"],
   ...:     ]

Because, A, B, C, D have connections in these connection data in the connection_data: ["A", "B", "C"], ["B", "D"], ["A", "C"], ["C", "D"], E and F have connection by ["E", "F"].

To summarize my questions:

  1. What is this type of problem called in general?
  2. I think I can implement a many for-loop based solver. But is there any efficient algorithm or implementation in python for this type of problem?
Atsushi Sakai
  • 938
  • 10
  • 22
  • 2
    I think this is relevant: [https://stackoverflow.com/questions/20154368/union-find-implementation-using-python](https://stackoverflow.com/questions/20154368/union-find-implementation-using-python) – Shlomo Gottlieb Jan 29 '22 at 15:32

2 Answers2

4

Networkx provides an implementation of a union find datastructure [1] [2] that solves this problem efficiently:

from networkx.utils.union_find import UnionFind

groups = [
    ["A", "B", "C"],
    ["B", "D"],
    ["A", "C"],
    ["E", "F"],
    ["C", "D"],
]

ds = UnionFind()
for gp in groups:
  ds.union(*gp)
for s in ds.to_sets():
  print(s)

# {'B', 'C', 'D', 'A'}
# {'E', 'F'}
hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51
2

Note: This answer is actually slower than the union-find algorithm given by hilberts_drinking_problem. If all of the input connections are pairs (i.e. of size 2), the runtime of both algorithms is essentially the same. However, this is not the case for OP's question. See the comments for more details.


You can construct a graph where the nodes are lettered A, B, C ..., and place an undirected, unweighted edge between two nodes if the initial groupings dictate that they should be in the same group. Then, the final groups are the connected components of the constructed graph. (This is the closest thing to what your problem is generally called.)

This can be done using a graph traversal algorithm, such as BFS or DFS, but if you don't want to code this up by hand, networkx can take care of everything once you've made the graph. You need to do a little bit of preprocessing on the groupings since some of them are of size greater than two, but otherwise networkx is well-suited for this problem:

from itertools import combinations
import networkx as nx

groups = [
    ["A", "B", "C"],
    ["B", "D"],
    ["A", "C"],
    ["E", "F"],
    ["C", "D"],
]

G = nx.Graph()

# Handle initial groupings of size greater than two by iterating over
# each possible pair of letters in the group.
for group in groups:
    G.add_edges_from(combinations(sorted(group), 2))

# Result should look something like [['B', 'C', 'A', 'D'], ['F', 'E']],
# but the ordering may be nondeterministic.
print(list(list(group) for group in nx.connected_components(G)))
BrokenBenchmark
  • 18,126
  • 7
  • 21
  • 33
  • I suggest `from itertools import combinations` then you can replace your triple loop with `for group in groups: G.add_edges_from(combinations(sorted(group), 2))` – Stef Jan 29 '22 at 10:24
  • Also note that you can write `nx.connected_components` directly without knowing that it's in `nx.algorithms.components`. This has three advantages: you don't need to know where it is exactly; it's shorter and more readable; if in a future version, networkx is organized differently, your code won't break. – Stef Jan 29 '22 at 10:28
  • Also, personally when I read answers on StackOverflow, I really like to be able to see the output. So in this case I suggest adding the result `# [['E', 'F'], ['B', 'A', 'C', 'D']]` just below the final `print` statement. – Stef Jan 29 '22 at 10:32
  • 1
    I think union find is _the_ way to approach this problem, although [this](https://stackoverflow.com/questions/28398101/union-find-or-dfs-which-one-is-better-to-find-connected-component) reference provides some relevant discussion. – hilberts_drinking_problem Jan 29 '22 at 10:51
  • @Stef Revised answer. – BrokenBenchmark Jan 29 '22 at 19:12
  • @hilberts_drinking_problem IMO explaining it in terms of a graph is easier because you can easily break the problem down in terms of nodes, edges, and a standard graph algorithm. I don't think there's anything wrong with union-find though. (Someone in the discussion you linked noted that `union()` isn't an `O(1)` operation, but the inverse Ackermann function is bounded by a small constant in practice.) – BrokenBenchmark Jan 29 '22 at 19:18
  • @BrokenBenchmark The graph version has complexity proportional to the sum of squares of sizes of the groups (because a group of size k results in k(k-1)/2 edges). The union-find version has complexity proportional to the sum of sizes of the groups. – Stef Jan 30 '22 at 15:28
  • @Stef Ah, fair point. Normally the problem is formulated in a way where all the groups are of size 2, so this issue doesn't usually crop up. – BrokenBenchmark Jan 30 '22 at 16:27
  • I also added a clear note indicating that union-find is faster in this case. – BrokenBenchmark Jan 30 '22 at 16:30