0

Given a list of sets (sets of strings such as setlist = [{'this','is'},{'is','a'},{'test'}]), the idea is to join pairwise -union- sets that share strings in common. The snippet below takes the literal approach of testing pairwise overlap, joining, and starting anew using an inner loop break.

I know this is the pedestrian approach, and it does take forever for lists of usable size (200K sets of between 2 and 10 strings).

Any advice on how to make this more efficient? Thanks.

j    = 0
while True:
    if j == len(setlist): # both for loops are done
        break # while
    for i in range(0,len(setlist)-1):
        for j in range(i+1,len(setlist)):
            a = setlist[i];
            b = setlist[j];
            if not set(a).isdisjoint(b):     # ... then join them
                newset = set.union( a , b )  # ... new set
                del setlist[j]            # ... drop highest index
                del setlist[i]            # ... drop lowest index
                setlist.insert(0,newset)  # ... introduce consolidated set, which messes up i,j
                break                        # ... back to the top for fresh i,j
        else:
            continue
        break
user2105469
  • 1,413
  • 3
  • 20
  • 37
  • What is `setlist`? – Huy Vo Feb 12 '17 at 01:51
  • I've edited the question and added a setlist example, with the output `[{'this','is','a'},{'test'}]` expected. – user2105469 Feb 12 '17 at 01:53
  • So your expected output is "this is a test"? – Huy Vo Feb 12 '17 at 01:54
  • https://en.wikipedia.org/wiki/Connected_component_(graph_theory)#Algorithms – user2357112 Feb 12 '17 at 01:55
  • See [here](http://stackoverflow.com/questions/9110837/python-simple-list-merging-based-on-intersections) for a bunch of timing comparisons and [here](http://stackoverflow.com/questions/20154368/union-find-implementation-using-python) for a potential O(n) algorithm (I haven't checked the logic myself.) – DSM Feb 12 '17 at 02:18

1 Answers1

2

As @user2357112 mentioned in comments this can be thought of as a graph problem. Every set is a vertex and every word shared between two sets is an edge. Then you can just iterate over vertices and do BFS (or DFS) for every unseen vertex to generate a connected component.

Other option is to use Union-Find. The advantage of the union find is that you don't need to construct a graph and there's no degenerate case when all the sets have same contents. Here's an example of it in action:

from collections import defaultdict

# Return ancestor of given node
def ancestor(parent, node):
    if parent[node] != node:
        # Do path compression
        parent[node] = ancestor(parent, parent[node])

    return parent[node]

def merge(parent, rank, x, y):
    # Merge sets that x & y belong to
    x = ancestor(parent, x)
    y = ancestor(parent, y)

    if x == y:
        return

    # Union by rank, merge smaller set to larger one
    if rank[y] > rank[x]:
        x, y = y, x

    parent[y] = x
    rank[x] += rank[y]

def merge_union(setlist):
    # For every word in sets list what sets contain it
    words = defaultdict(list)

    for i, s in enumerate(setlist):
        for w in s:
            words[w].append(i)

    # Merge sets that share the word
    parent = list(range(len(setlist)))
    rank = [1] * len(setlist)
    for sets in words.values():
        it = iter(sets)
        merge_to = next(it)
        for x in it:
            merge(parent, rank, merge_to, x)

    # Construct result by union the sets within a component
    result = defaultdict(set)
    for merge_from, merge_to in enumerate(parent):
        result[merge_to] |= setlist[merge_from]

    return list(result.values())

setlist = [
    {'this', 'is'},
    {'is', 'a'},
    {'test'},
    {'foo'},
    {'foobar', 'foo'},
    {'foobar', 'bar'},
    {'alone'}
]

print(merge_union(setlist))

Output:

[{'this', 'is', 'a'}, {'test'}, {'bar', 'foobar', 'foo'}, {'alone'}]
niemmi
  • 17,113
  • 7
  • 35
  • 42