0

I am faced with the following array:

y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]

What I would like to do is extract the cluster with the highest scores. That would be

best_cluster = [200,297,275,243]

I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering. However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one. In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243]. What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?

dre_84w934
  • 678
  • 8
  • 27
  • K-means is inherently an unsupervised learning algorithm. Your data are not supplied w/ classes, therefore the k-means clustering algorithm is left to classify the data. This article might provide you some insight into determining the number of clusters: https://pythonprogramminglanguage.com/kmeans-elbow-method/ – rahlf23 Jul 23 '18 at 21:48
  • Possible duplicate of [How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?](https://stackoverflow.com/questions/35094454/how-would-one-use-kernel-density-estimation-as-a-1d-clustering-method-in-scikit) – MoxieBall Jul 23 '18 at 21:52
  • @MoxieBall, it's not the same. What you have there is supervised, there are 3 clusters set up – dre_84w934 Jul 23 '18 at 22:03
  • I don't think the provided is the best cluster! – Mohd Jul 09 '20 at 20:30

3 Answers3

9

Try MeanShift. From the sklean user guide of MeanShift:

The algorithm automatically sets the number of clusters, ...

Modified demo code:

import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth

# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))

# #############################################################################
# Compute clustering with MeanShift

# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)

ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)

print("number of estimated clusters : %d" % n_clusters_)
print(labels)

Output:

number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]

Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.


BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.

See also:

EasonL
  • 803
  • 6
  • 12
  • yes, sorry, I'll have to change that, Kmeans is an unsupervised learning algo ;) – dre_84w934 Jul 24 '18 at 06:41
  • so then, would you say MeanShift is computationally more efficient on large data then kmeans? – dre_84w934 Jul 24 '18 at 06:43
  • so I tried it out and it works fine, however one problem I am facing now is that it does not distinguish between negative and positive values. Therefore from example array [-100,-20,-50, 55, 30, 50], it will see the -100 as the best option, which is not correct, it is in fact the lowest. Does this come from the fitting? – dre_84w934 Jul 24 '18 at 13:13
  • @dre_84w934 It is actually the other way around: KMeans with minibatch is scalable, whereas 10K is the recommended upper limit for MeanShift. Clustering algorithms only tell you what the clusters are. You will have to figure out the 'highest' one afterwards. – EasonL Jul 25 '18 at 04:51
  • Right, I've noticed. However, when the are negative value, if the number after - is bigger than the positive, it is taking the negative as higher order batch – dre_84w934 Jul 25 '18 at 05:35
  • This by no means unsupervised, as one still needs to provide the quantile value and hence bw! – Mohd Jul 09 '20 at 20:07
4

Clustering is overkill here

Just compute the differences of subsequent elements. I.e. look at x[i]-x[i-1].

Choose the k largest differences as split points. Or define a threshold on when to split. E.g. 20. Depends on your data knowledge.

This is O(n), much faster than all the others mentioned. Also very understandable and predictable.

On one dimensional ordered data, any method that doesn't use the order will be slower than necessary.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I don't think clustering is an overkill here because I cannot determine the threshold manually therefore I have to use some kind of clustering algorithm to determine the biggest values in one group – dre_84w934 Jul 29 '18 at 13:30
  • Do some simple stats on the threshold. Such as the three sigma rule. – Has QUIT--Anony-Mousse Jul 29 '18 at 23:21
  • Do some simple stats on the threshold. Such as the three sigma rule. It's not as if clustering would be parameter free, in the contrary. Choosing parameters for clustering is a big problem! – Has QUIT--Anony-Mousse Jul 29 '18 at 23:22
  • ok, performance wise, you might be right, clustering might be an overkill here. I would argue however that it is parameter free, at least for the purposes I am using. With your solution, I might have to do some stats, which for high volume data is not that feasible. Ultimately I am using this algo to filter out poor scores from a fuzzy matching algo. As I have 20-30 scores for each one of the 2000 rows, you can see why I am preferring something parameter free, that simply does the job even if it might be a bit of overkill. – dre_84w934 Jul 29 '18 at 23:34
  • 1
    Clustering is all but parameter free! Just see the many questions on this issue here. Even the most stupid method, k-means, needs very careful data normalization (=many hidden parameters) and of course k. You can just choose the k largest gaps in this approach, too, and you'll likely have a better results than with k-means. – Has QUIT--Anony-Mousse Jul 29 '18 at 23:38
  • kmeans is indeed not a parameter free algo, however what i've marked as the correct answer, MeanShift, works just fine. Ain't no parameter needed ;) – dre_84w934 Jul 29 '18 at 23:41
  • and s Klargest gaps could work, never heard of it, so i'll have to give it a read – dre_84w934 Jul 29 '18 at 23:43
  • Meanshift has many more parameters. Such as the Kernel to use, the radius, ... You are quite fortunate if the defaults work for you. K largest gaps is what I suggested here! – Has QUIT--Anony-Mousse Jul 30 '18 at 05:37
  • @dre_84w934 You still need the quantile to get bw. Using quantile=None to estimate bw produces a mess. I have been stuck with this algorithm for 5 days now. I have designed algorithms, funny though, to estimate the quantiles based on the standard deviation from my data, works well, but not that good as I aim for the performance. Now, I am looking for an alternative of this MeansShift. – Mohd Jul 09 '20 at 20:14
-2

HDBSCAN is the best clustering algorithm and you should always use it.

Basically all you need to do is provide a reasonable min_cluster_size, a valid distance metric and you're good to go.

For min_cluster_size I suggest using 3 since a cluster of 2 is lame and for metric the default euclidean works great so you don't even need to mention it.

Don't forget that distance metrics apply to vectors and here we have scalars so some ugly reshaping is in order.

To put it all together and assuming by "cluster with the highest scores" you mean the cluster that includes the max value we get:

from hdbscan import HDBSCAN
import numpy as np

y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
y = np.reshape(y, (-1, 1))

clusterer = HDBSCAN(min_cluster_size=3)
cluster_labels = clusterer.fit_predict(y)

best_cluster = clusterer.exemplars_[cluster_labels[y.argmax()]].ravel()
print(best_cluster)

The output is [297 200 275 243]. Original order is not preserved. C'est la vie.