You what use what is called a probabilistic k-means algorithm. Which is running k-means several times on the same input data points. On each run keep track on which point ended up in what cluster. As the number of runs through k-means increases the better probability you can predict some point belonging in a certain cluster. So for some data point Pi
you can find it's probability to some cluster Ci
, by doing: P(Pi in Ci) = (# of times in Ci)/(# of times not in Ci)
. Which would be done for each cluster, and the cluster with the highest probability would be your prediction for data point Pi
. Or you could simply keep count of how many times a point bonged in each cluster and simply pick the highest count.
Note: Ensure that the labeling for each cluster is consistent through each execution of k-means. This is done by having the centroides of the clusters used in iteration i
to be used as a seed ( has a slightly increased probability of being closer to the last centriod ) for the next clusters generated in iteration i+1
, or the new iteration can be completely randomized.
Probabilistic k-means is used because the standard k-means can have a poor clustering approximation. Although a major problem is that this is computationally heavy for the sake of accuracy. As such only really works on 1D or 2D data, and may be too heavy on real world data. A similar approach that is to deal with this problem is k-means++.
A simpler and more common approach is to simply perform k-means once and for the new data point Pi
simply take the distance between that point and all the centriods of the cluster and pick the lowest one. That cluster would be used as the prediction for that point. This method is much faster but can lead to inaccuracies and poor approximations from the generated cluster, especially if the point is an outlier and the difference between two minimal clusters is close.
If you would like an example I have implemented a probabilistic k-means on a 1-D data set for grey-scale color data for an old class assignment. The same thing can be done for your dataset (just note that the higher the dimension of the data set the slower this will run) That being said it was done awhile back (basically the code is a little messy), the important part of the code starts at: for(var q=0;q<numOfComputes;q++)
.