-1

I am using the K-means algorithm for creating k clusters out of 2-dimensional data.

I intend to use the clusters to predict what cluster does an incoming data-point belong to. How should i use the k-means algorithm for prediction?

Note: I am using the k-means JS implementation from here

stackoverflowN
  • 447
  • 6
  • 21
  • Looks like you want to use a probabilistic k-means for cluster prediction. – Spencer Wieczorek Oct 09 '15 at 19:41
  • 1
    Possible duplicate of [Simple approach to assigning clusters for new data after k-means clustering](http://stackoverflow.com/questions/20621250/simple-approach-to-assigning-clusters-for-new-data-after-k-means-clustering) – Has QUIT--Anony-Mousse Oct 10 '15 at 10:46

2 Answers2

1

Assign each new object to the nearest cluster center, too. That's all.

k-means finds a Voronoi cell partioning of your data. The only consistent cluster assignment with this model (unless you want to e.g. update the model based on the new data, which may cause relabeling of old points) is by assigning each point to the Voronoi cell it is located in. That is easily done using above rule.

Note that clustering is not classification. Few clustering algorithms will allow you to apply their model to classify new instances. They were not meant to be used this way! The purpose of clustering is to better understand your data. The work flow is to cluster, then study the result, and then maybe build something new/different from what you have learned. It is usually not helpful to be able to classify new objects as "should go to cluster 3", because this assumes that A) the clusters are meaningful/useful (often they are not) and B) they are clean (often, some objects are not in the cluster they would be classified by a human).

This question has been asked several times before (use search!):

Community
  • 1
  • 1
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I'm assuming that you mean the *nearest* center of each centroid of the cluster? That will give a decent but not entirely accurate prediction, as outliers have a good change of going to the wrong cluster (*basically it's bound to cluster generation error*). – Spencer Wieczorek Oct 10 '15 at 00:40
  • k-means builds a Voronoi partition of the data set. I don't try to fix limitations of k-means here. The appropriate Voronoi cell is easy to identify this way. – Has QUIT--Anony-Mousse Oct 10 '15 at 09:55
-1

You what use what is called a probabilistic k-means algorithm. Which is running k-means several times on the same input data points. On each run keep track on which point ended up in what cluster. As the number of runs through k-means increases the better probability you can predict some point belonging in a certain cluster. So for some data point Pi you can find it's probability to some cluster Ci, by doing: P(Pi in Ci) = (# of times in Ci)/(# of times not in Ci). Which would be done for each cluster, and the cluster with the highest probability would be your prediction for data point Pi. Or you could simply keep count of how many times a point bonged in each cluster and simply pick the highest count.

Note: Ensure that the labeling for each cluster is consistent through each execution of k-means. This is done by having the centroides of the clusters used in iteration i to be used as a seed ( has a slightly increased probability of being closer to the last centriod ) for the next clusters generated in iteration i+1, or the new iteration can be completely randomized.

Probabilistic k-means is used because the standard k-means can have a poor clustering approximation. Although a major problem is that this is computationally heavy for the sake of accuracy. As such only really works on 1D or 2D data, and may be too heavy on real world data. A similar approach that is to deal with this problem is k-means++.


A simpler and more common approach is to simply perform k-means once and for the new data point Pi simply take the distance between that point and all the centriods of the cluster and pick the lowest one. That cluster would be used as the prediction for that point. This method is much faster but can lead to inaccuracies and poor approximations from the generated cluster, especially if the point is an outlier and the difference between two minimal clusters is close.


If you would like an example I have implemented a probabilistic k-means on a 1-D data set for grey-scale color data for an old class assignment. The same thing can be done for your dataset (just note that the higher the dimension of the data set the slower this will run) That being said it was done awhile back (basically the code is a little messy), the important part of the code starts at: for(var q=0;q<numOfComputes;q++).

Spencer Wieczorek
  • 21,229
  • 7
  • 44
  • 54
  • You do not have "consistent" labels in k-means (except for your 1d toy case, where you do not need k-means). This approach does not work. – Has QUIT--Anony-Mousse Oct 10 '15 at 09:57
  • @Anony-Mousse It's computationally heavy, but the approach does work and is used to give more accurate results. It's not hard to place labels in k-means, that's how the algorithm works. Don't think the approach doesn't work purely because you don't know it. – Spencer Wieczorek Oct 10 '15 at 14:51
  • I know k-means, and the results tend to be all over the place, in particular on real data. Define "ensure consistent labeling" in your approach, please; and what to do if the clusters vary heavily... – Has QUIT--Anony-Mousse Oct 10 '15 at 14:57
  • @Anony-Mousse I've updated my question, the purpose of this algorithm is because results can be all over the place. – Spencer Wieczorek Oct 10 '15 at 15:47
  • But because if that, `Ci` is meaningless. If you recycle the means for another run, you will get the same result. Your "Note" does not make sense. If the means have converged, they will yield the exact same clustering for the next run; then the approach will not work either. – Has QUIT--Anony-Mousse Oct 10 '15 at 19:09
  • @Anony-Mousse No they won't, and no the results will not be the same, they will improve with each run. Again, if you don't know about the approach doesn't imply it doesn't work. Please do some basic research on k-means and specific approaches, it does work, your lack of knowledge on the subject doesn't mean it wont. – Spencer Wieczorek Oct 10 '15 at 19:58
  • You are referring to *iterations* then, not complete (*converged*) k-means. Please do some basic research on k-means and terminology. Using the first iterations is not a good idea, too. The 1d/2d claim is also nonsense. Also, it is not at all computational heav (but linear), and k-means++ solves a different thing (converging faster by non-uniform random initialization). – Has QUIT--Anony-Mousse Oct 10 '15 at 20:28
  • @Anony-Mousse Each iteration is an entire run of k-means through the data-set, it's computationally heavy since there are several runs of k-means. It's a probabilistic k-means in the same method of that pKNN uses probabilistic. – Spencer Wieczorek Oct 10 '15 at 20:48
  • @Anony-Mousse Let me find some papers over *probabilistic k-means* so you can reference them. – Spencer Wieczorek Oct 10 '15 at 20:56
  • If it is an entire run, the centroids have converged and it does not make sense to rerun k-means with them. Even multiple runs of k-means is cheap, and O(n). Oh, and "kNN" is the abbreviation of "k nearest neighbors", which is something very different than k-means. – Has QUIT--Anony-Mousse Oct 10 '15 at 21:46
  • @Anony-Mousse They use the same probabilistic method, I know they are different algorithms. K-means generally has errors in cluster computation, running it several times gives a better estimate. – Spencer Wieczorek Oct 10 '15 at 21:50
  • Estimate of *what*? Not of cluster assignment, because every run will give you different clusters, and there is no 1:1 correspondence across multiple runs of k-means. – Has QUIT--Anony-Mousse Oct 10 '15 at 21:54
  • @Anony-Mousse That's why the centriod of the last cluster is used as a seed for the new clusters, the new clusters are based from the centriod of the old ones. I've already said this in my answer. – Spencer Wieczorek Oct 10 '15 at 21:56
  • But that clustering had *converged*. By the definition of convergence, you don't get anything different by repeating the procedure. Gosh, do the math. Running k-means will finish after 1 iterarion and return the exact same result (it still is a local minima, even if you try again). – Has QUIT--Anony-Mousse Oct 10 '15 at 21:58
  • @Anony-Mousse k-means has difference results for different initial centriods. – Spencer Wieczorek Oct 10 '15 at 22:01
  • It *can* have different results, that is why you usually try multiple *random* initializations. But if you put in the *final* centroids, they will not change anymore (and any interim result will converge to the same final centroids). Unless you have a bug in your code. **Try it.** Run k-means in R, and use the centroids as initial centers for another run. (Or do the math. You are really on the wrong track. If it has converged, it has converged. Except for initialization, k-means is deterministic.) – Has QUIT--Anony-Mousse Oct 10 '15 at 22:06
  • @Anony-Mousse Do you mean if you cluster the data, and run k-means on the already clustered data? – Spencer Wieczorek Oct 10 '15 at 22:12
  • No, the input data, as you wrote. – Has QUIT--Anony-Mousse Oct 10 '15 at 22:14
  • @Anony-Mousse Maybe I'm not being clear, the new initialization are nearly random, but has a slight influence (seed) from the last centriods. It's used so you can get a better local minima. – Spencer Wieczorek Oct 10 '15 at 22:22
  • Define how that "nearly random", "slight influence" and "seed" is realized in a way that you get different results *but* retain a 1:1 correspondence. As you wrote it above, it is 100% the previous centroids - so yes, you are not clear. – Has QUIT--Anony-Mousse Oct 10 '15 at 22:27
  • @Anony-Mousse It really depends on the users implementation. But basically the new initialization has a slightly better chance to being closer to the last (or the several previous) centriod than being far away. – Spencer Wieczorek Oct 10 '15 at 22:32
  • How does the initialization work? Why does it have a higher chance? What happens if it doesn't? If you do multiple run's, isn't there a high chance it will drift into something very different? Sorry, but I doubt this will work. In particular, since it **lacks the crucial detail** how to do it. – Has QUIT--Anony-Mousse Oct 10 '15 at 22:46
  • @Anony-Mousse After several iterations you can take the average or "best" cluster. The first initialization is random, the rest essentially is to attempt to pick better initialization for better clusters. The basic way to do it is simply random iterations each time, I could simply say that instead, as it would be a more simply way to do it. – Spencer Wieczorek Oct 10 '15 at 22:50
  • No. The averages will tend to the data set center. And I already explained that random initialization doesn't work because you need a 1:1 correspondence of clusters for your overall goal. Remember that the question wasn't how to improve k-means clustering quality, but how to classify new instance to the existing clusters. With random initialization cluster 1 of one run may be cluster 2 of another, etc. – Has QUIT--Anony-Mousse Oct 10 '15 at 22:59
  • You have a point on the existing clusters in question, guess maybe I'm trying to answer the wrong question. – Spencer Wieczorek Oct 10 '15 at 23:32