1

I have a list of hundreds of 2-tuples:

tuples = [('foo', 'bar'), ('hi', 'bar'), ('hi', 'bye'),
          ('dddd', 'cccc'), ('bbbb', 'cccc'), ('aaaa', 'xxxx') ... ]

My aim is to build clusters: every time an element appears in a tuple, it is similar to the elements of this tuple and to all the elements similar to the elements of this tuple. So I want it to be recursive.

With this example, "foo" is similar to "bar" and bar appears with "hi", so we add "hi", and then "hi" appears with "bye", so we add "bye", etc:

clusters = [('foo', 'bar', 'hi', 'bye'),
            ('dddd', 'cccc', 'bbbb'),
            ('aaaa', 'xxxx')]

Is there a good algorithm for that? Thanks!

4 Answers4

2

See comments.

def find_clusters( tuples ):
    # clusterlist contains at each position either a set
    # representing an actual cluster, or an int referring
    # to another cluster that has eaten this one here.
    # the cluster id is its position within this list
    clusterlist=[]
    # clustermap maps an element to the id of the containing
    # cluster within clusterlist
    clustermap = {}

    # here we find the current cluster id for elem, by following the
    # chain within clusterlist, and replace that entire chain
    # with the new cluster id n.   We return the old cluster id.
    def set_cluster_id( elem, n ):
        if elem not in clustermap:
            return None
        k = clustermap[elem]
        # clusters may be replaced by references to other clusters,
        # we follow that chain
        while k < n and isinstance( clusterlist[k], int ):
            k1 = clusterlist[k]
            # this is optional, we make the chain shorter
            # by making this entry point directly to the current cluster
            clusterlist[k] = n
            k = k1
        return k

    for t in tuples:
        # for each tuple we create a new cluster
        thiscluster = set(t)
        n = len( clusterlist ) # the id of thiscluster
        for x in t:
            # we absorb existing clusters into the new one
            # if there is overlap
            k = set_cluster_id(x, n)
            if k is not None and k != n:
                thiscluster.update( clusterlist[k] )
                # we replace the existing cluster
                # with a reference to the new one
                clusterlist[k] = n 
            clustermap[x] = n
        clusterlist.append(thiscluster)

    return [ tuple(x) for x in clusterlist if isinstance( x, set ) ]

print( find_clusters( [('foo', 'bar'), ('hi', 'bar'), ('hi', 'bye'),
          ('dddd', 'cccc'), ('bbbb', 'cccc'), ('aaaa', 'xxxx'), ('aaaa', 'xxxx')] ) )

EDIT: I fixed a small performance issue, now the performance of this should be linear with the combined number of all elements in all tuples.

pentadecagon
  • 4,717
  • 2
  • 18
  • 26
0

If I have understood correctly, then this is a very well-known problem, of finding connected components (graphs) in a forest (given a series of pairs of connected nodes).

It can be solved in almost O(n) complexity with the Weighted Quick Union Find with path compression algorithm. http://algs4.cs.princeton.edu/15uf/

Abhishek Bansal
  • 12,589
  • 4
  • 31
  • 46
  • Looks like it, and then the question is duplicate of [this one](http://stackoverflow.com/questions/3067529/looking-for-a-set-union-find-algorithm) – Ricardo Cárdenes Feb 08 '14 at 13:55
-1

You could use this algorithm (although maybe not the quickest). You can use sets for clusters instead of lists in case you fear the elements may get duplicated (for example [('foo', 'bar'), ('foo', 'baz'), ('bar', 'baz')]).

clusters = []
# browse the pairs
for elem1, elem2 in tuples:
    # Try to find a cluster that ends with elem1
    for cluster in clusters:
        # Add elem2 to the chain
        if elem1 in cluster:
            cluster.append(elem2)
            break
        if elem2 in cluster:
            cluster.append(elem1)
            break
    # If we found nothing, create a new cluster
    else:
        # We use lists for clusters because tuples are immutable
        clusters.append([elem1, elem2])
# Optional: convert clusters to tuples
clusters = list(map(tuple, clusters))
Cilyan
  • 7,883
  • 1
  • 29
  • 37
-1

You asked for a "good algorithm", so I suppose one has a better time/space complexity than others. However, note that if you are particularly concerned about speed, you might want to be using a lower-level language like C++ so you have more control over the data structures.

However, I thought of a reasonably Pythonic way of doing it:

tuples = [('foo', 'bar'), ('hi', 'bar'), ('hi', 'bye'), ('dddd', 'cccc'), ('bbbb', 'cccc'), ('aaaa', 'xxxx')]

elementLocations = dict() # hashtable of which cluster each element is in
clusters = []

for tuple in tuples:
    if tuple[0] in elementLocations: # hopefully O(1)
        clusters[elementLocations[tuple[0]]].add(tuple[1])
    elif tuple[1] in elementLocations:
        clusters[elementLocations[tuple[1]]].add(tuple[0])
    else: # create new cluster
        clusters.append(set(tuple))
        elementLocations[tuple[0]] = len(clusters) - 1
        elementLocations[tuple[1]] = len(clusters) - 1

clusters = map(tuple, clusters) # to achieve the output as described in the question
print clusters

>>> 
[['hi', 'foo', 'bar'], ['bye', 'hi'], ['cccc', 'bbbb', 'dddd'], ['aaaa', 'xxxx']]

As far as I can see, this is O(n), assuming Python's dict and set classes are working properly.

Note that if you are not concerned with duplicates in your clusters, you wouldn't need to use python sets.

icedtrees
  • 6,134
  • 5
  • 25
  • 35