0

I have a large set of data containing the description for 81432 images. These descriptions are generated by an image descriptor which generates a vector (for each image) with 127 positions. So, I have a matrix with 81432 rows and 127 columns.

And I'm running kmeans from R, but I just don't know how to interpret the results. I've set a number of clusters, the algorithm runs and so what? I want to plot the Elbow rule, but I don't even know how to do it.

Victor Leal
  • 1,055
  • 2
  • 12
  • 28
  • 1
    Please read [how to create a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Include some sample data and describe exactly what is it you want your plot to look like. If you're just looking for visualization recommendations, then that really isn't a programming question and may be a better fit for [stats.se] rather than Stack Overflow. – MrFlick Oct 13 '15 at 14:43
  • Thanks @MrFlick for the explanation. Actually, I really don't know what kind of visualization I'm looking for (maybe something like a scatter plot). I've put this question on Cross Validated too. – Victor Leal Oct 13 '15 at 16:26

2 Answers2

1

An example code snippet using Kmeans and Principal Component Analysis for analyzing and visualizing datasets :

library(calibrate)
library(plyr)
library(gclus)
library(scatterplot3d)
library(cluster)
library(fpc)
library(mclust)
library(rpanel)
library(rgl)
library(lattice)
library(tm);
library(RColorBrewer) 



#Read data
mydata <- read.table(file="c:/data.mtx", header=TRUE, row.names=1, sep="");

# Lets look at the correlations
mydata.cor = abs(cor(scale(mydata)))
mydata.cor[,1:2]

#lets look at the data in interactive 3D plot before PCA
rp.plot3d(mydata[,1],mydata[,2], mydata[,3])

# Doing the PCA 
mydata.pca<- prcomp(mydata, retx=TRUE, center=TRUE, scale=TRUE);
summary(mydata.pca)
#3D plot of first three PCs
rp.plot3d(mydata.pca$x[,1],mydata.pca$x[,2],mydata.pca$x[,3])


#Eigenvalues of components for Kaiser Criterion
mydata.pca$sdev ^2


#scree test for determining optimal number of PCs (Elbow rule)
par(mfrow=c(1,2))
screeplot(mydata.pca,main="Scree Plot",xlab="Components")
screeplot(mydata.pca,type="line",main="Scree Plot")

#Scores
scores = mydata.pca$x
##  Plot of the scores, with the axes
pdf("scores.pdf")
plot (scores[,1], scores[,2], xlab="Scores 1", ylab="Scores 2")
text (x=scores[,1], y=scores[,2], labels=row.names (scores), cex=c(0.4,0.4), col = "blue")
lines(c(-5,5),c(0,0),lty=2)  ##  Draw the horizontal axis
lines(c(0,0),c(-4,3),lty=2)  ##  Draw the vertical axis
dev.off() 

#finding possible number of clusters in Kmeans
wss <- (nrow(scale(mydata))-1)*sum(apply(scale(mydata),2,var)); 
for (i in 2:20) wss[i] <- sum(kmeans(scale(mydata),centers=i)$withinss);
plot(1:20, wss, type="b", xlab="Number of Clusters",  ylab="Within groups sum of squares");

#Performing K-Means and visualizing the result
km1<-kmeans(scores[,1:2], algorithm = "Hartigan-Wong", centers=4)   
#par(mfrow = c(1, 1))
pdf("km.pdf")
plot(scores[,1:2], col = km1$cluster);
points(km1$centers, col = 1:5, pch = 8, cex=2);
scatterplot3d(km1$centers, pch=20, highlight.3d = TRUE, type="h");
# getting cluster means 
aggregate(scores[,1:2],by=list(km1$cluster),FUN=mean);
# appending cluster assignment
clustercounts <- data.frame(scores[,1:2], km1$cluster);
#Cluster Plot against 1st 2 principal components
clusplot(scores[,1:2], km1$cluster, color=TRUE, shade=TRUE, labels=2, lines=0, cex=c(0.2,0.2));
dev.off()
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • This answer is not helpful since most of us probably don't have `"c:/data.mtx"` sitting on our machines – Señor O Oct 13 '15 at 16:10
  • @SeñorO the question is not helpful since it does not contain a reproducible data set – C8H10N4O2 Oct 13 '15 at 16:15
  • 1
    @C8H10N4O2 ok what do you want me to do about that? – Señor O Oct 13 '15 at 16:16
  • @SeñorO I'm guessing you know how to downvote ... :) – C8H10N4O2 Oct 13 '15 at 16:19
  • 1
    @C8H10N4O2 so you're asking people to downvote my question, just because I didn't provide a reproducible data set? How can I give a 20MB file here for you? – Victor Leal Oct 13 '15 at 16:21
  • @VictorLeal no offense intended. I just don't think the answer deserves a downvote for not being reproducible -- in fact it is fairly thorough -- since the question isn't reproducible. I didn't downvote anything. I was just saying that if SeñorO was going to downvote something it should be the question not the answer. You could certainly provide sample data, and it is a common practice here. – C8H10N4O2 Oct 13 '15 at 16:27
  • @C8H10N4O2 Right! I agree with you, but I really don't know how to make my question reproducible because the file is kind of large to be uploaded here. And about the algorithm itself, it is just calling R functions... – Victor Leal Oct 13 '15 at 16:32
  • 3
    @VictorLeal If you're going to ask people to go through the work of answering your question, go through the work of making the best reproducible data set you can that represents your problem. A lot of times, just in doing that, you will learn more than you will by getting an answer (I say that from personal experience) – Señor O Oct 13 '15 at 16:34
0

To plot the Elbow Rule (which is about how near are the points to its centroid) we have to use the tot.withinss (Total within-cluster sum of squares).

This answer is regarding the use of R.

Victor Leal
  • 1,055
  • 2
  • 12
  • 28