1

I am performing a K means clustering using the kmeans function in R. After scaling my data. After I get the clusters, instead of getting individual cluster assignment, I want the the distance of each point from it's cluster center. Here is the code I am using.

data=read.csv("C:/Users/My_Folder/data.csv") # A data frame of 200 rows and 20 variables
traindata=data[,c(3,4)] # Features on which I want to do clustering
traindata=scale(traindata,center = T,scale=T) # Feature Scaling
km.result=rep(0,nrow(traindata))
km.cluster = kmeans(traindata, 2,iter.max=20,nstart=25)$cluster
cluster_1_num = sum(km.cluster==1)
cluster_2_num = sum(km.cluster==2)
if(cluster_1_num>cluster_2_num){
  km.result[km.cluster==1]=1}
else{
  km.result[km.cluster==2]=1}
data$cluster=km.result

This code effectively divides my 200 rows into 2 clusters. Instead of labels , is there a way to get distance of each point from it's cluster center. Do I need to re scale my data to original values.

NG_21
  • 685
  • 2
  • 13
  • 22
  • How about giving us a small reproducible example to work with? – Roman Luštrik Dec 31 '14 at 09:58
  • @RomanLuštrik, Ok. I have already given the code I am using. Any way where I can give a csv file of my data ? – NG_21 Dec 31 '14 at 10:17
  • Construct a minimal, self contained example. See [this topic](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on tips on how to achieve this. – Roman Luštrik Dec 31 '14 at 12:42

1 Answers1

2

It happens that you capture only the cluster element of the return value of kmeans, which returns also the centers of the clusters. Try this:

 #generate some data
 traindata<-matrix(rnorm(400),ncol=2)
 traindata=scale(traindata,center = T,scale=T) # Feature Scaling
 #get the full kmeans
 km.cluster = kmeans(traindata, 2,iter.max=20,nstart=25)
 #define a (euclidean) distance function between two matrices with two columns
 myDist<-function(p1,p2) sqrt((p1[,1]-p2[,1])^2+(p1[,2]-p2[,2])^2)
 #gets the distances
 myDist(traindata[km.cluster$cluster==1,],km.cluster$centers[1,,drop=FALSE])
 myDist(traindata[km.cluster$cluster==2,],km.cluster$centers[2,,drop=FALSE])

Of course you can write your own myDist function according to your needs.

nicola
  • 24,005
  • 3
  • 35
  • 56