0

I have a list of lists where each sublist contains some integers:

o = [[1,2],[3,4],[2,3],[5,4]]

I'd like to create a new list of lists in which any two sublists in o that share a common member will be merged. This merging process should continue until no two sublists share a common element. Given o, we would merge [1,2] with [2,3] because they share a 2, then we'd merge that group with [3,4] because [1,2,3] and [3,4] both contain a 3, and so on.

The expected output of clustering o would be [[1,2,3,4,5]]

I have a hunch there exists an approach to this task that's far superior to my current approach (see below). Any suggestions others can offer on the most efficient (in time, then space) way to accomplish this task would be greatly appreciated.

from collections import defaultdict

o = [[1,2],[3,4],[2,3],[5,4]]

def group_lists(list_of_lists):
  '''
  Given a list of lists, continue combining sublist
  elements that share an element until no two sublist
  items share an element.
  '''
  to_cluster = set(tuple(i) for i in list_of_lists)
  keep_clustering = True
  while keep_clustering:
    keep_clustering = False
    d = defaultdict(set)
    for i in to_cluster:
      for j in i:
        d[j].add(i)
    clustered = set()
    for i in d.values():
      # remove duplicate entries from each cluster
      flat = tuple(set([item for sublist in i for item in sublist]))
      clustered.add(flat)
    if not to_cluster == clustered:
      keep_clustering = True
      to_cluster = clustered
  # done clustering!
  return clustered

print(group_lists(o))
martineau
  • 119,623
  • 25
  • 170
  • 301
duhaime
  • 25,611
  • 17
  • 169
  • 224

2 Answers2

3
from collections import deque

o = [[1,2],[3,4],[2,3],[5,4], [10, 11, 12], [13, 15], [4,6], [6, 8], [23,25]]

o = sorted(o, key=lambda x:min(x))
queue = deque(o)

grouped = []
while len(queue) >= 2:
    l1 = queue.popleft()
    l2 = queue.popleft()
    s1 = set(l1)
    s2 = set(l2)

    if s1 & s2:
        queue.appendleft(s1 | s2)
    else:
        grouped.append(s1)
        queue.appendleft(s2)
#         print(set(l1).union(set(l2)))
if queue:
    grouped.append(queue.pop())

print(grouped)

output

[set([1, 2, 3, 4, 5, 6, 8]), set([10, 11, 12]), set([13, 15]), set([25, 23])]
Asterisk
  • 3,534
  • 2
  • 34
  • 53
  • 1
    `o = [[1, 4], [2, 3], [4, 5]]` results to `[{1, 4}, {2, 3}, {4, 5}]` – niemmi May 26 '18 at 19:54
  • @niemmi You are right. It fails on this input. I guess a straightforward fix is to sort by max and see if the result differs from sorting my min. If so take the result with smaller number of items. Anyways, the graph based solution from thread you linked is much better. – Asterisk May 27 '18 at 02:07
  • @Asterisk Even with the change you mentioned this technique does not maximally merge groups. I'll post some tests you can run in that thread in just a few moments... – duhaime Jun 10 '18 at 14:06
  • 1
    @duhaime I agree. Thanks for letting me know. – Asterisk Jun 11 '18 at 03:23
0

You can use recursion:

def cluster(d, current = []):
  options = [i for i in d if any(c in current for c in i)]
  _flattened = [i for b in options for i in b]
  d = list(filter(lambda x:x not in options, d))
  if not options or not d:
    yield current+_flattened
  if d and not options:
    yield from cluster(d[1:], d[0])
  elif d:
    yield from cluster(d, current+_flattened)

for a, *b in [[[1,2],[6,4],[2,3],[5,4]], [[1,2],[3,4],[2,3],[5,4]], [[1,2],[3,4],[2,3],[5,4], [10, 11, 12], [13, 15], [4,6], [6, 8], [23,25]]]:
  print([list(set(i)) for i in cluster(b, a)])

Output:

[[1, 2, 3], [4, 5, 6]]
[[1, 2, 3, 4, 5]]
[[1, 2, 3, 4, 5, 6, 8], [10, 11, 12], [13, 15], [25, 23]]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102