2

I have a document dataset, I converted it to a matrix and run the k-means clustering, how do I plot a graph to show the clusters with the matrix?

k<-5
kmeansResult<-kmeans(m3,k)
plot(m3, col = kmeansResult$cluster)
points(kmeansResult$centers, col = 1:5, pch = 8, cex = 5)

enter image description here

user3456230
  • 217
  • 4
  • 13

1 Answers1

5

As alluded to in the comments, the problem is that your dataset may have > 2 dimensions (more then 2 variables), whereas your plot is restricted to 2 (or perhaps 3) dimensions. So some sort of dimensionality reduction is needed. The typical way is to run a principal components analysis on the original data and then plot the first two PC's, organized by cluster. So here are three ways to do this in R, using the mtcars dataset as an example.

df     <- mtcars[,c(1,3,4,5,6,7)]      # subset of mtcars dataset
set.seed(1)                            # for reproducible example
km <- kmeans(df,centers=3)             # k-means, 3 clusters
# using package cluster
library(cluster)
clusplot(df,km$cluster)

# using package ade4
library(ade4)
pca    <-prcomp(df, scale.=T, retx=T)  # principal components analysis
plot.df <- cbind(pca$x[,1], pca$x[,2]) # first and second PC
s.class(plot.df, factor(km$cluster))

# ggplot solution
pca    <-prcomp(df, scale.=T, retx=T)  # principal components analysis
# gg: data frame of PC1 and PC2 scores with corresponding cluster
gg <- data.frame(cluster=factor(km$cluster), x=scores$PC1, y=scores$PC2)
# calculate cluster centroid locations
centroids <- aggregate(cbind(x,y)~cluster,data=gg,mean)
# merge centroid locations into ggplot dataframe
gg <- merge(gg,centroids,by="cluster",suffixes=c("",".centroid"))
# calculate 95% confidence ellipses
library(ellipse)
conf.rgn  <- do.call(rbind,lapply(1:3,function(i)
  cbind(cluster=i,ellipse(cov(gg[gg$cluster==i,2:3]),centre=as.matrix(centroids[i,2:3])))))
conf.rgn  <- data.frame(conf.rgn)
conf.rgn$cluster <- factor(conf.rgn$cluster)
# plot cluster map
library(ggplot2)
ggplot(gg, aes(x,y, color=cluster))+
  geom_point(size=3) +
  geom_point(data=centroids, size=4) +
  geom_segment(aes(x=x.centroid, y=y.centroid, xend=x, yend=y))+
  geom_path(data=conf.rgn)

Notice that the three options all give differing ellipses! This is because they are defined differently. clusplot(...) by default generates "minimum volume ellpises" which have the correct center and orientation, but are sized to just large enough to enclose all the points in the cluster. s.plot(...) generates ellipses based on a scale factor that can be set in an argument to the call. The ggplot(...) solution generates ellipses which are 95% confidence regions for each cluster (assuming that the points in each cluster follow a bivariate normal distribution). You can see from this that the clusters clearly overlap; this is, several of the points might arguably belong in more than one cluster. This gives a more realistic representation of the data, IMO, which is one of the reasons I prefer it, even though clearly it's more work.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • I tried using your second method and I attached the result in the question. All the 4 clusters seemed to squeeze together, is that possible for me to zoom in? – user3456230 May 12 '14 at 02:56