Cluster list of lists based on shared values

Question

I have a list of lists where each sublist contains some integers:

o = [[1,2],[3,4],[2,3],[5,4]]

I'd like to create a new list of lists in which any two sublists in o that share a common member will be merged. This merging process should continue until no two sublists share a common element. Given o, we would merge [1,2] with [2,3] because they share a 2, then we'd merge that group with [3,4] because [1,2,3] and [3,4] both contain a 3, and so on.

The expected output of clustering o would be [[1,2,3,4,5]]

I have a hunch there exists an approach to this task that's far superior to my current approach (see below). Any suggestions others can offer on the most efficient (in time, then space) way to accomplish this task would be greatly appreciated.

from collections import defaultdict

o = [[1,2],[3,4],[2,3],[5,4]]

def group_lists(list_of_lists):
  '''
  Given a list of lists, continue combining sublist
  elements that share an element until no two sublist
  items share an element.
  '''
  to_cluster = set(tuple(i) for i in list_of_lists)
  keep_clustering = True
  while keep_clustering:
    keep_clustering = False
    d = defaultdict(set)
    for i in to_cluster:
      for j in i:
        d[j].add(i)
    clustered = set()
    for i in d.values():
      # remove duplicate entries from each cluster
      flat = tuple(set([item for sublist in i for item in sublist]))
      clustered.add(flat)
    if not to_cluster == clustered:
      keep_clustering = True
      to_cluster = clustered
  # done clustering!
  return clustered

print(group_lists(o))

score 3 · Answer 1 · answered May 26 '18 at 01:25

3

from collections import deque

o = [[1,2],[3,4],[2,3],[5,4], [10, 11, 12], [13, 15], [4,6], [6, 8], [23,25]]

o = sorted(o, key=lambda x:min(x))
queue = deque(o)

grouped = []
while len(queue) >= 2:
    l1 = queue.popleft()
    l2 = queue.popleft()
    s1 = set(l1)
    s2 = set(l2)

    if s1 & s2:
        queue.appendleft(s1 | s2)
    else:
        grouped.append(s1)
        queue.appendleft(s2)
#         print(set(l1).union(set(l2)))
if queue:
    grouped.append(queue.pop())

print(grouped)

output

[set([1, 2, 3, 4, 5, 6, 8]), set([10, 11, 12]), set([13, 15]), set([25, 23])]

answered May 26 '18 at 01:25

Asterisk

3,534
2
34
53

1

`o = [[1, 4], [2, 3], [4, 5]]` results to `[{1, 4}, {2, 3}, {4, 5}]` – niemmi May 26 '18 at 19:54
@niemmi You are right. It fails on this input. I guess a straightforward fix is to sort by max and see if the result differs from sorting my min. If so take the result with smaller number of items. Anyways, the graph based solution from thread you linked is much better. – Asterisk May 27 '18 at 02:07
@Asterisk Even with the change you mentioned this technique does not maximally merge groups. I'll post some tests you can run in that thread in just a few moments... – duhaime Jun 10 '18 at 14:06
1

@duhaime I agree. Thanks for letting me know. – Asterisk Jun 11 '18 at 03:23

Ajax1234 · Answer 2 · 2018-05-26T02:15:22.587

0

You can use recursion:

def cluster(d, current = []):
  options = [i for i in d if any(c in current for c in i)]
  _flattened = [i for b in options for i in b]
  d = list(filter(lambda x:x not in options, d))
  if not options or not d:
    yield current+_flattened
  if d and not options:
    yield from cluster(d[1:], d[0])
  elif d:
    yield from cluster(d, current+_flattened)

for a, *b in [[[1,2],[6,4],[2,3],[5,4]], [[1,2],[3,4],[2,3],[5,4]], [[1,2],[3,4],[2,3],[5,4], [10, 11, 12], [13, 15], [4,6], [6, 8], [23,25]]]:
  print([list(set(i)) for i in cluster(b, a)])

Output:

[[1, 2, 3], [4, 5, 6]]
[[1, 2, 3, 4, 5]]
[[1, 2, 3, 4, 5, 6, 8], [10, 11, 12], [13, 15], [25, 23]]

edited May 26 '18 at 02:15

answered May 25 '18 at 23:24

Ajax1234

69,937
8
61
102

this algorithm returns incorrect results. Try `o = [[1,2],[6,4],[2,3],[5,4]]`... – duhaime May 26 '18 at 01:28
@duhaime Please see my recent edit. – Ajax1234 May 26 '18 at 02:17

Cluster list of lists based on shared values

2 Answers2