2

My goal is to cluster the items in the list below based on distances between any two consecutive items. The number of clusters is not specified, only the maximum distance between any two consecutive items that must not be exceeded for them to be in the same cluster.

My attempt

import itertools
max_dist=20
_list=[1,48,52,59,89,94,103,147,151,165]
Ideal_result= [[1],[48,52,59],[89,94,103],[147,151,165]]

def clust(list_x, max_dist):
    q=[]
    for a, b in itertools.combinations(list_x,2):
        if b-a<=20:
            q.append(a),(b)
        else:continue
        yield q
print list(clust(_list,max_dist))

Output:

[[48,48,52,89,89,94,147,147,151],[48,48,52,89,89,94,147,147,151],..]`

The output is completely wrong, but I just wanted to include my attempt.

Any suggestions on how to get the ideal result? Thanks.

thefourtheye
  • 233,700
  • 52
  • 457
  • 497
Tiger1
  • 1,327
  • 5
  • 19
  • 40
  • Do the clusters all have to be the same size? – Two-Bit Alchemist Mar 16 '14 at 09:40
  • @Two-BitAlchemist, not necessarily the same size. What matters is the distance between any two consecutive items; starting from the first item in the list (in ascension). See the ideal result for more details. – Tiger1 Mar 16 '14 at 09:49
  • 1
    NumPy has some hierarchical clustering routines: http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html, but this may be overkill. – jscs Mar 16 '14 at 23:24
  • @JoshCaswell, thanks for the suggestion. I will look it up straight away. – Tiger1 Mar 16 '14 at 23:46
  • possible duplicate of [1D Number Array Clustering](http://stackoverflow.com/questions/11513484/1d-number-array-clustering) – Has QUIT--Anony-Mousse Mar 18 '14 at 01:29

1 Answers1

2

This passes your test:

def cluster(items, key_func):
    items = sorted(items)
    clusters = [[items[0]]]
    for item in items[1:]:
        cluster = clusters[-1]
        last_item = cluster[-1]
        if key_func(item, last_item):
            cluster.append(item)
        else:
            clusters.append([item])
    return clusters

Where key_func returns True if the current and previous items should be part of the same cluster:

>>> cluster([1,48,52,59,89,94,103,147,151,165], lambda curr, prev: curr-prev < 20)
[[1], [48, 52, 59], [89, 94, 103], [147, 151, 165]]

Another possibility would be to modify the "equivalent code" for itertools.groupby() to likewise take more than one argument for the key function.

jscs
  • 63,694
  • 13
  • 151
  • 195