Clustering values by their proximity in python (machine learning?)

Question

I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set.

The sorted output is something like this:

[1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]

If you lay these values down on a spreadsheet you see that they make up groups

[1,1,5,6,1,5] [10,22,23,23] [50,51,51,52] [100,112,130] [500,512,600] [12000,12230]

Is there a way to programatically get those groupings?

Maybe some clustering algorithm using a machine learning library? Or am I overthinking this?

I've looked at scikit but their examples are way too advanced for my problem...

score 41 · Answer 1 · edited May 23 '17 at 12:00

41

Don't use clustering for 1-dimensional data

Clustering algorithms are designed for multivariate data. When you have 1-dimensional data, sort it, and look for the largest gaps. This is trivial and fast in 1d, and not possible in 2d. If you want something more advanced, use Kernel Density Estimation (KDE) and look for local minima to split the data set.

There are a number of duplicates of this question:

edited May 23 '17 at 12:00

Community

1
1

answered Aug 22 '13 at 16:13

Has QUIT--Anony-Mousse

76,138
12
138
194

This approach might be sensitive to noise. – jhegedus Aug 22 '16 at 17:36
On the contrary. KDE is smooth and thus not too sensitive to noise. Much less so than k-means which is known to be very sensitive due to squared error terms. – Has QUIT--Anony-Mousse Aug 22 '16 at 20:15
Interesting, thanks for pointing this out. – jhegedus Aug 22 '16 at 20:53
What does "largest" mean in this context? If I have a list of gaps, how do I find those that constitute a large enough gap to separate "clusters"? – Tomáš M. Oct 09 '16 at 19:43

score 29 · Answer 2 · answered Aug 21 '13 at 18:01

29

A good option if you don't know the number of clusters is MeanShift:

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

x = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]

X = np.array(zip(x,np.zeros(len(x))), dtype=np.int)
bandwidth = estimate_bandwidth(X, quantile=0.1)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

for k in range(n_clusters_):
    my_members = labels == k
    print "cluster {0}: {1}".format(k, X[my_members, 0])

Output for this algorithm:

cluster 0: [ 1  1  5  6  1  5 10 22 23 23 50 51 51 52]
cluster 1: [100 112 130]
cluster 2: [500 512]
cluster 3: [12000]
cluster 4: [12230]
cluster 5: [600]

Modifying quantilevariable you can change the clustering number selection criteria

answered Aug 21 '13 at 18:01

jabaldonedo

25,822
8
77
77

9

The first argument of `np.array` needs to be `list(zip(x,np.zeros(len(x))))`. Otherwise, Python throws an error: _TypeError: int() argument must be a string, a bytes-like object or a number, not 'zip'_ – Logan Jun 14 '17 at 23:54
2

This approach may not work very well for some inputs that aren't easily "clusterable", e.g. `x = [90, 100, 110]`. It will then fail with `ValueError: Expected n_neighbors > 0. Got 0` (which can be avoided with parameter tuning). For such inputs https://stackoverflow.com/a/18385795/942774 is probably the much simpler and better answer. – Hendrik Nov 09 '17 at 10:11
I replaced `X = np.array(zip(x,np.zeros(len(x))), dtype=np.int)` with `X = np.array(x).reshape(-1, 1)` and it works perfectly. – Tarik Jun 13 '23 at 06:06

score 7 · Answer 3 · edited Mar 22 '17 at 10:04

7

You can use clustering to group these. The trick is to understand that there are two dimensions to your data: the dimension you can see, and the "spatial" dimension that looks like [1, 2, 3... 22]. You can create this matrix in numpy like so:

import numpy as np

y = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
x = range(len(y))
m = np.matrix([x, y]).transpose()

Then you can perform clustering on the matrix, with:

from scipy.cluster.vq import kmeans
kclust = kmeans(m, 5)

kclust's output will look like this:

(array([[   11,    51],
       [   15,   114],
       [   20, 12115],
       [    4,     9],
       [   18,   537]]), 21.545126372346271)

For you, the most interesting part is the first column of the matrix, which says what the centers are along that x dimension:

kclust[0][:, 0]
# [20 18 15  4 11]

You can then assign your points to a cluster based on which of the five centers they are closest to:

assigned_clusters = [abs(cluster_indices - e).argmin() for e in x]
# [3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 2, 2, 2, 2, 1, 1, 0, 0, 0]

edited Mar 22 '17 at 10:04

oz123

27,559
27
125
187

answered Aug 21 '13 at 17:52

David Robinson

77,383
16
167
187

an updated function kmeans2 (in scipy.cluster.vq) now outputs both centroid and label, e.g. `kclust, label = kmeans(m, 5)` – Sean Apr 14 '16 at 18:07
Hi, The code does not work. Error in first line for obvious reasons. The last line also produces an error, 'cluster_indices' not defined. Can you please help to get this code running ? – le_llama May 31 '16 at 14:36
@gprakhar Use `cluster_indices = kclust[0][:, 0]`. – joost Jul 19 '16 at 12:26
I think last line should be assigned_clusters = [abs(kclust[0][:, 0]- e).argmin() for e in y] N.b. second to last character is changed to y – Isaac Jan 12 '22 at 15:35

Clustering values by their proximity in python (machine learning?)

3 Answers3

Don't use clustering for 1-dimensional data

Linked