42

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The current method used by the system I'm on is K-means, but that seems like overkill.

Is there a better way of performing this task?

Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how would that work?

I see how KDE returns a density, but how do I tell it to split the data into bins?

How do I have a fixed number of bins independent of the data (that's one of my requirements) ?

More specifically, how would one pull this off using scikit learn?

My input file looks like:

 str ID     sls
 1           10
 2           11 
 3            9
 4           23
 5           21
 6           11  
 7           45
 8           20
 9           11
 10          12

I want to group the sls number into clusters or bins, such that:

Cluster 1: [10 11 9 11 11 12] 
Cluster 2: [23 21 20] 
Cluster 3: [45] 

And my output file will look like:

 str ID     sls    Cluster ID  Cluster centroid
    1        10       1               10.66
    2        11       1               10.66
    3         9       1               10.66 
    4        23       2               21.33   
    5        21       2               21.33
    6        11       1               10.66
    7        45       3               45
    8        20       2               21.33
    9        11       1               10.66 
    10       12       1               10.66
Alex Kinman
  • 2,437
  • 8
  • 32
  • 51

3 Answers3

97

Write code yourself. Then it fits your problem best!

Boilerplate: Never assume code you download from the net to be correct or optimal... make sure to fully understand it before using it.

%matplotlib inline

from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,50)
e = kde.score_samples(s.reshape(-1,1))
plot(s, e)

enter image description here

from scipy.signal import argrelextrema
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print "Minima:", s[mi]
print "Maxima:", s[ma]
> Minima: [ 17.34693878  33.67346939]
> Maxima: [ 10.20408163  21.42857143  44.89795918]

Your clusters therefore are

print a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]]
> [10 11  9 11 11 12] [23 21 20] [45]

and visually, we did this split:

plot(s[:mi[0]+1], e[:mi[0]+1], 'r',
     s[mi[0]:mi[1]+1], e[mi[0]:mi[1]+1], 'g',
     s[mi[1]:], e[mi[1]:], 'b',
     s[ma], e[ma], 'go',
     s[mi], e[mi], 'ro')

enter image description here

We cut at the red markers. The green markers are our best estimates for the cluster centers.

ankostis
  • 8,579
  • 3
  • 47
  • 61
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I would be hesitant to call this method better than k-means. It does involve selecting an arbitrary bandwidth and then calculating 50 density estimates. That being said, I don't know if there is a better way to do it with kernel density estimation. – David Maust Feb 08 '16 at 02:56
  • 7
    You don't have to know k. You not only get better centers (less affected by outliers) but also *sound* splitting points (not only at half the way). There is plenty of literature on the bandwidth such as Silverman's rule. Also. who cares about computing 50 density estimates? You could precompute the kernel and do this in a fast convolution. – Has QUIT--Anony-Mousse Feb 08 '16 at 06:29
  • I will also add that this is a particularly fast, non-linear scaling method to 1D clustering. – Matthew Dec 14 '16 at 19:15
  • hi I have posted a question about this answer, could you pls help me about it? https://stackoverflow.com/questions/60355497/choosing-bandwidthlinspace-for-kernel-density-estimation-why-my-bandwidth-doe – emily.mi Feb 22 '20 at 18:40
  • 1
    There is a slight error in this accepted aswer (I can't comment previously due to my rank). See my answer below. – Muhammad Yasirroni Feb 20 '21 at 04:40
  • would you mind re-rendering your plots without transparency in the axis? they're hard to see in darkmode – CervEd Aug 06 '21 at 16:43
  • I was getting the error: `ModuleNotFoundError: No module named 'sklearn.neighbors.kde'`. In version 0.24.2 at least, it seems `from sklearn.neighbors.kde import KernelDensity` must be replaced by `from sklearn.neighbors import KernelDensity` – acros Nov 05 '21 at 14:45
  • @HasQUIT--Anony-Mousse I do care about 50 density estimation. I have 50,000 tests running in parallel probably 500,000 times a day. So, I do care and I think your answer is sub-optimal time-wise. – Ash Mar 06 '23 at 20:43
16

There is a little error in the accepted answer by @Has QUIT--Anony-Mousse (I can't comment nor suggest an edit due my reputation).

The line:

print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])

Should be edited into:

print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])

That's because mi and ma is an index, where s[mi] and s[ma] is the value. If you use mi[0] as the limit, you risk and error splitting if your upper and lower linspace >> your upper and lower data. For example, run this code and see the difference in split result:

import numpy as np
from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot
from scipy.signal import argrelextrema

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,100)
e = kde.score_samples(s.reshape(-1,1))
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]

print('Grouping by HAS QUIT:')
print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
print('Grouping by yasirroni:')
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a < s[mi][1])], a[a >= s[mi][1]])

result:

Grouping by Has QUIT:
[] [10 11  9 11 11 12] [23 21 45 20]
Grouping by yasirroni:
[10 11  9 11 11 12] [23 21 20] [45]
Muhammad Yasirroni
  • 1,512
  • 12
  • 22
2

Further improving the responses above by @yasirroni, to dynamically print all clusters (not just 3 from the above) the line:

print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])

can be changed into:

print(a[a < s[mi][0]])  # print most left cluster

# print all middle cluster
for i_cluster in range(len(mi)-1):
    print(a[(a >= s[mi][i_cluster]) * (a <= s[mi][i_cluster+1])])

print(a[a >= s[mi][-1]])  # print most right cluster

This would ensure that all the clusters are taken into account.

Muhammad Yasirroni
  • 1,512
  • 12
  • 22