How to cluster items in a list without specifying the number of clusters

Question

My goal is to cluster the items in the list below based on distances between any two consecutive items. The number of clusters is not specified, only the maximum distance between any two consecutive items that must not be exceeded for them to be in the same cluster.

My attempt

import itertools
max_dist=20
_list=[1,48,52,59,89,94,103,147,151,165]
Ideal_result= [[1],[48,52,59],[89,94,103],[147,151,165]]

def clust(list_x, max_dist):
    q=[]
    for a, b in itertools.combinations(list_x,2):
        if b-a<=20:
            q.append(a),(b)
        else:continue
        yield q
print list(clust(_list,max_dist))

Output:

[[48,48,52,89,89,94,147,147,151],[48,48,52,89,89,94,147,147,151],..]`

The output is completely wrong, but I just wanted to include my attempt.

Any suggestions on how to get the ideal result? Thanks.

@Two-BitAlchemist, not necessarily the same size. What matters is the distance between any two consecutive items; starting from the first item in the list (in ascension). See the ideal result for more details. — Tiger1, Mar 16 '14 at 09:49
NumPy has some hierarchical clustering routines: http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html, but this may be overkill. — jscs, Mar 16 '14 at 23:24
@JoshCaswell, thanks for the suggestion. I will look it up straight away. — Tiger1, Mar 16 '14 at 23:46
possible duplicate of [1D Number Array Clustering](http://stackoverflow.com/questions/11513484/1d-number-array-clustering) — Has QUIT--Anony-Mousse, Mar 18 '14 at 01:29

score 2 · Accepted Answer · answered Mar 16 '14 at 23:47

This passes your test:

def cluster(items, key_func):
    items = sorted(items)
    clusters = [[items[0]]]
    for item in items[1:]:
        cluster = clusters[-1]
        last_item = cluster[-1]
        if key_func(item, last_item):
            cluster.append(item)
        else:
            clusters.append([item])
    return clusters

Where key_func returns True if the current and previous items should be part of the same cluster:

>>> cluster([1,48,52,59,89,94,103,147,151,165], lambda curr, prev: curr-prev < 20)
[[1], [48, 52, 59], [89, 94, 103], [147, 151, 165]]

Another possibility would be to modify the "equivalent code" for itertools.groupby() to likewise take more than one argument for the key function.

How to cluster items in a list without specifying the number of clusters

1 Answers1