How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

Question

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The current method used by the system I'm on is K-means, but that seems like overkill.

Is there a better way of performing this task?

Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how would that work?

I see how KDE returns a density, but how do I tell it to split the data into bins?

How do I have a fixed number of bins independent of the data (that's one of my requirements) ?

More specifically, how would one pull this off using scikit learn?

My input file looks like:

 str ID     sls
 1           10
 2           11 
 3            9
 4           23
 5           21
 6           11  
 7           45
 8           20
 9           11
 10          12

I want to group the sls number into clusters or bins, such that:

Cluster 1: [10 11 9 11 11 12] 
Cluster 2: [23 21 20] 
Cluster 3: [45]

And my output file will look like:

 str ID     sls    Cluster ID  Cluster centroid
    1        10       1               10.66
    2        11       1               10.66
    3         9       1               10.66 
    4        23       2               21.33   
    5        21       2               21.33
    6        11       1               10.66
    7        45       3               45
    8        20       2               21.33
    9        11       1               10.66 
    10       12       1               10.66

@DavidMaust 1) When I tried running sklearn's k-means on univariate data, I started getting errors. I had to trick it by having it cluster on 2d data which was identical copies of the original 1d data. 2) According to this [post](http://stackoverflow.com/a/11516590/3967806) it's a bad idea. — Alex Kinman, Jan 30 '16 at 00:23
If you have a fixed number of bins, isn't this the same as just generating a histogram of the data and specifying the number or size of the bins? You might check out numpy.histogram http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.histogram.html. — Tom Johnson, Feb 23 '16 at 13:44
Histogram would have a fixed bin width, I need the bin width to be variable. — Alex Kinman, Feb 24 '16 at 00:55
I had the same task and again 1-d array and i just used the pandas qcut() - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html — Rafael Shirakyan, Aug 03 '20 at 08:48

score 97 · Accepted Answer · edited Mar 15 '23 at 10:43

97

Write code yourself. Then it fits your problem best!

Boilerplate: Never assume code you download from the net to be correct or optimal... make sure to fully understand it before using it.

%matplotlib inline

from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,50)
e = kde.score_samples(s.reshape(-1,1))
plot(s, e)

from scipy.signal import argrelextrema
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print "Minima:", s[mi]
print "Maxima:", s[ma]
> Minima: [ 17.34693878  33.67346939]
> Maxima: [ 10.20408163  21.42857143  44.89795918]

Your clusters therefore are

print a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]]
> [10 11  9 11 11 12] [23 21 20] [45]

and visually, we did this split:

plot(s[:mi[0]+1], e[:mi[0]+1], 'r',
     s[mi[0]:mi[1]+1], e[mi[0]:mi[1]+1], 'g',
     s[mi[1]:], e[mi[1]:], 'b',
     s[ma], e[ma], 'go',
     s[mi], e[mi], 'ro')

We cut at the red markers. The green markers are our best estimates for the cluster centers.

edited Mar 15 '23 at 10:43

ankostis

8,579
3
47
61

answered Feb 02 '16 at 11:16

Has QUIT--Anony-Mousse

76,138
12
138
194

I would be hesitant to call this method better than k-means. It does involve selecting an arbitrary bandwidth and then calculating 50 density estimates. That being said, I don't know if there is a better way to do it with kernel density estimation. – David Maust Feb 08 '16 at 02:56
7

You don't have to know k. You not only get better centers (less affected by outliers) but also *sound* splitting points (not only at half the way). There is plenty of literature on the bandwidth such as Silverman's rule. Also. who cares about computing 50 density estimates? You could precompute the kernel and do this in a fast convolution. – Has QUIT--Anony-Mousse Feb 08 '16 at 06:29
I will also add that this is a particularly fast, non-linear scaling method to 1D clustering. – Matthew Dec 14 '16 at 19:15
hi I have posted a question about this answer, could you pls help me about it? https://stackoverflow.com/questions/60355497/choosing-bandwidthlinspace-for-kernel-density-estimation-why-my-bandwidth-doe – emily.mi Feb 22 '20 at 18:40
1

There is a slight error in this accepted aswer (I can't comment previously due to my rank). See my answer below. – Muhammad Yasirroni Feb 20 '21 at 04:40
would you mind re-rendering your plots without transparency in the axis? they're hard to see in darkmode – CervEd Aug 06 '21 at 16:43
I was getting the error: `ModuleNotFoundError: No module named 'sklearn.neighbors.kde'`. In version 0.24.2 at least, it seems `from sklearn.neighbors.kde import KernelDensity` must be replaced by `from sklearn.neighbors import KernelDensity` – acros Nov 05 '21 at 14:45
@HasQUIT--Anony-Mousse I do care about 50 density estimation. I have 50,000 tests running in parallel probably 500,000 times a day. So, I do care and I think your answer is sub-optimal time-wise. – Ash Mar 06 '23 at 20:43

Muhammad Yasirroni · Answer 2 · 2022-07-07T03:18:45.170

16

There is a little error in the accepted answer by @Has QUIT--Anony-Mousse (I can't comment nor suggest an edit due my reputation).

The line:

print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])

Should be edited into:

print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])

That's because mi and ma is an index, where s[mi] and s[ma] is the value. If you use mi[0] as the limit, you risk and error splitting if your upper and lower linspace >> your upper and lower data. For example, run this code and see the difference in split result:

import numpy as np
from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot
from scipy.signal import argrelextrema

a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,100)
e = kde.score_samples(s.reshape(-1,1))
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]

print('Grouping by HAS QUIT:')
print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
print('Grouping by yasirroni:')
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a < s[mi][1])], a[a >= s[mi][1]])

result:

Grouping by Has QUIT:
[] [10 11  9 11 11 12] [23 21 45 20]
Grouping by yasirroni:
[10 11  9 11 11 12] [23 21 20] [45]

edited Jul 07 '22 at 03:18

answered Oct 09 '20 at 00:07

Muhammad Yasirroni

1,512
12
22

@Yassirroni - nice answer. but for this, do we have normalize/standardize our input data? or I can pass the raw data as is – The Great May 29 '22 at 02:08
Or not required as we are doing each variable one by one? – The Great May 29 '22 at 02:10
@TheGreat you can settings the kde on `KernelDensity`. Anyway, I suggest you to look at Python pakcages called `jenkspy` first. – Muhammad Yasirroni May 29 '22 at 04:06
Sure but can i check whether we have to standardize/normalize our data for 1d variable clusterinf? – The Great May 29 '22 at 04:31
This is the realed post - https://stats.stackexchange.com/questions/576982/standardization-normalization-for-1d-clustering – The Great May 29 '22 at 04:32
what is the purpose of s here ? – vanetoj Jan 24 '23 at 03:09

score 2 · Answer 3 · edited Jul 20 '22 at 05:30

Further improving the responses above by @yasirroni, to dynamically print all clusters (not just 3 from the above) the line:

print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])

can be changed into:

print(a[a < s[mi][0]])  # print most left cluster

# print all middle cluster
for i_cluster in range(len(mi)-1):
    print(a[(a >= s[mi][i_cluster]) * (a <= s[mi][i_cluster+1])])

print(a[a >= s[mi][-1]])  # print most right cluster

This would ensure that all the clusters are taken into account.

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

3 Answers3

Linked