5

In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U, S and Vt and then by choosing a suitable number of eigen values I truncated Vt, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k in k-means, I was suggested to look at cross-validation.

Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans is to simply use the function from scipy.

import numpy, scipy

# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix

# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:]) 

# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)

Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.

Community
  • 1
  • 1
Legend
  • 113,822
  • 119
  • 272
  • 400

3 Answers3

7

To run k-fold cross validation, you'd need some measure of quality to optimize for. This could be either a classification measure such as accuracy or F1, or a specialized one such as the V-measure.

Even the clustering quality measures that I know of need a labeled dataset ("ground truth") to work; the difference with classification is that you only need part of your data to be labeled for the evaluation, while the k-means algorithm can make use all the data to determine the centroids and thus the clusters.

V-measure and several other scores are implemented in scikit-learn, as well as generic cross validation code and a "grid search" module that optimizes according to a specified measure of evaluation using k-fold CV. Disclaimer: I'm involved in scikit-learn development, though I didn't write any of the code mentioned.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 1
    Thank you for your reply. I am sorry but I am a little confused at this point. So should my take away point be that I cannot use k-fold cross validation for the output of k-means (because the output of k-means is just what I specify it to be, the data classified into the clusters that I specify) and that I need to manually label some of my dataset? I guess my question is more on the lines of, how would I utilize this technique to determine k in k-means or do I follow some additional steps. – Legend Jul 10 '11 at 01:38
  • @Legend: well, cross validation can be very useful for finding the right parameters, but since it requires some data to be labeled and changing *k* actually changes **the number of labels**, it may not be the best technique for optimizing this specific parameter. I can't help you much further; I don't do clustering on a daily basis. – Fred Foo Jul 10 '11 at 14:07
  • Great! Thank you for clarifying that. I was suggested this technique from another question of mine. I will look for some other approaches. Thank you once again for your time. – Legend Jul 11 '11 at 02:38
1

Indeed to do traditional cross validation with F1-score or V-Measure as scoring function you would need some labeled data as ground truth. But in this case you could just count the number of classes in the ground truth dataset and use it as your optimal value for K, hence no-need for cross-validation.

Alternatively you could use a cluster stability measure as unsupervised performance evaluation and do some kind of cross validation procedure for that. However this is not yet implemented in scikit-learn even though it's still on my personal todo list.

You can find additional info on this approach in the following answer on metaoptimize.com/qa. In particular you should read Clustering Stability: An Overview by Ulrike von Luxburg.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
0

Here they use withinss to find an optimal number of clusters. "withinss" is an attribute of the kmeans object returned. That could be used to find a minimum "error"

https://www.statmethods.net/advstats/cluster.html

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, 
   centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

This formula isn't exactly it. But I'm working on one myself. The model would still change every time, but it would at least be the best model out of a bunch of iterations.

thistleknot
  • 1,098
  • 16
  • 38